Revisions to Alternatives to iterrow loops in python pandas dataframes

added 6 characters in body

Source Link

edited Jul 9, 2024 at 20:37

2.3k
1
10
21

prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_wd['Product_w'] * d.Rating_wd['Rating_w'],
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)

prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_w * d.Rating_w,
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)

prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d['Product_w'] * d['Rating_w'],
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)

added 145 characters in body

Source Link

edited Jul 9, 2024 at 18:32

tdy

2.3k
1
10
21

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just returnbuild a numpy array z.

Suggested pandas approach:

Use vectorized pandas methods. Compute in long form and then reshape into your desired wide form:

Merge into a long cross table
Compute the weights and weighted prices
Pivot Pivot into a wide cross table
Zero outMask the diagonal

This notNot only reducesdoes this reduce your 100+ lines of code to ~15 lines:

prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_w * d.Rating_w,
).pivotpivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).wheremask( #4
    cond=lambda d: np.identity(len(d.index)) !=, 1dtype=bool),
    other=0,
)

But isit's also significantly faster:

>>> prices
Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just return a numpy array z.

Suggested pandas approach:

Compute in long form and then reshape into your desired wide form:

Merge into a long cross table
Compute the weights and weighted prices
Pivot into a wide cross table
Zero out the diagonal

This not only reduces your 100+ lines of code to ~15 lines:

data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_w * d.Rating_w,
).pivot( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).where( #4
    cond=lambda d: np.identity(len(d.index)) != 1,
    other=0,
)

But is also significantly faster:

Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just build a numpy array.

Suggested approach:

Use vectorized pandas methods. Compute in long form and then reshape into your desired wide form:

Merge into a long cross table
Compute the weights and weighted prices
Pivot into a wide cross table
Mask the diagonal

Not only does this reduce your 100+ lines of code to ~15 lines:

prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_w * d.Rating_w,
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)

But it's also significantly faster:

>>> prices
Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

added 193 characters in body

Source Link

edited Jul 9, 2024 at 4:31

tdy

2.3k
1
10
21

I’d assume the root cause is my loops in the matrix parts.

Issues: Yes, looping is an anti-pattern in pandas, so iterrows should almost always be avoided.

You suspected correctly that iterrows is very inefficient and should almost always be avoided

Computing the final prices (z) at the numpy layer means that you lose the dataframe labels

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just return a numpy array z.

SuggestionsSuggested pandas approach:

Perform the computationsCompute in long form and then reshape into your desired wide form:

Merge into a long cross table
Compute the weights and weighted priceprices
Pivot into a wide cross table
Zero out the diagonal

This not only reduces your 100+ lines of code to ~15 lines:

Note how thisBut is also significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

And retains yourthe dataframe labels:

Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

And is significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Issues:

You suspected correctly that iterrows is very inefficient and should almost always be avoided

Computing the final prices (z) at the numpy layer means that you lose the dataframe labels

Suggestions:

Perform the computations in long form and then reshape into your desired wide form:

Merge into long cross table
Compute weighted price
Pivot into wide cross table
Zero out diagonal

Note how this retains your dataframe labels:

Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

And is significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I’d assume the root cause is my loops in the matrix parts.

Yes, looping is an anti-pattern in pandas, so iterrows should almost always be avoided.

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just return a numpy array z.

Suggested pandas approach:

Compute in long form and then reshape into your desired wide form:

Merge into a long cross table
Compute the weights and weighted prices
Pivot into a wide cross table
Zero out the diagonal

This not only reduces your 100+ lines of code to ~15 lines:

But is also significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

And retains the dataframe labels:

Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

added 232 characters in body

Source Link

edited Jul 9, 2024 at 0:08

tdy

2.3k
1
10
21

Loading

edited body

Source Link

edited Jul 8, 2024 at 23:34

tdy

2.3k
1
10
21

Loading

Source Link

answered Jul 8, 2024 at 23:21

tdy

2.3k
1
10
21

Loading

Stack Exchange Network

Return to Answer