Skip to main content
added 6 characters in body
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21
prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_wd['Product_w'] * d.Rating_wd['Rating_w'],
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)
prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_w * d.Rating_w,
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)
prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d['Product_w'] * d['Rating_w'],
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)
added 145 characters in body
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just returnbuild a numpy array z.

Suggested pandas approach:

Use vectorized pandas methods. Compute in long form and then reshape into your desired wide form:

  1. Merge into a long cross table
  2. Compute the weights and weighted prices
  3. PivotPivot into a wide cross table
  4. Zero outMask the diagonal

This notNot only reducesdoes this reduce your 100+ lines of code to ~15 lines:

prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_w * d.Rating_w,
).pivotpivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).wheremask( #4
    cond=lambda d: np.identity(len(d.index)) !=, 1dtype=bool),
    other=0,
)

But isit's also significantly faster:

>>> prices
Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just return a numpy array z.

Suggested pandas approach:

Compute in long form and then reshape into your desired wide form:

  1. Merge into a long cross table
  2. Compute the weights and weighted prices
  3. Pivot into a wide cross table
  4. Zero out the diagonal

This not only reduces your 100+ lines of code to ~15 lines:

data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_w * d.Rating_w,
).pivot( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).where( #4
    cond=lambda d: np.identity(len(d.index)) != 1,
    other=0,
)

But is also significantly faster:

Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just build a numpy array.

Suggested approach:

Use vectorized pandas methods. Compute in long form and then reshape into your desired wide form:

  1. Merge into a long cross table
  2. Compute the weights and weighted prices
  3. Pivot into a wide cross table
  4. Mask the diagonal

Not only does this reduce your 100+ lines of code to ~15 lines:

prices = data.merge( #1
    right=data,
    how='cross',
).assign( #2
    Product_w=lambda d: np.where(d['Product_x'] == d['Product_y'], 1, 0.5),
    Rating_w=lambda d: np.where(d['Rating_x'] == d['Rating_y'], 1, 0.8),
    Price=lambda d: d['Price_x'] * d['Price_y'] * d.Product_w * d.Rating_w,
).pivot_table( #3
    index=['Product_x', 'Year_x'],
    columns=['Product_y', 'Year_y'],
    values='Price',
).mask( #4
    cond=lambda d: np.identity(len(d.index), dtype=bool),
    other=0,
)

But it's also significantly faster:

>>> prices
Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0
added 193 characters in body
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21

I’d assume the root cause is my loops in the matrix parts.

Issues: Yes, looping is an anti-pattern in pandas, so iterrows should almost always be avoided.

  • You suspected correctly that iterrows is very inefficient and should almost always be avoided
  • Computing the final prices (z) at the numpy layer means that you lose the dataframe labels

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just return a numpy array z.


SuggestionsSuggested pandas approach:

Perform the computationsCompute in long form and then reshape into your desired wide form:

  1. Merge into a long cross table
  2. Compute the weights and weighted priceprices
  3. Pivot into a wide cross table
  4. Zero out the diagonal

This not only reduces your 100+ lines of code to ~15 lines:

Note how thisBut is also significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

And retains yourthe dataframe labels:

Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

And is significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Issues:

  • You suspected correctly that iterrows is very inefficient and should almost always be avoided
  • Computing the final prices (z) at the numpy layer means that you lose the dataframe labels

Suggestions:

Perform the computations in long form and then reshape into your desired wide form:

  1. Merge into long cross table
  2. Compute weighted price
  3. Pivot into wide cross table
  4. Zero out diagonal

Note how this retains your dataframe labels:

Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0

And is significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I’d assume the root cause is my loops in the matrix parts.

Yes, looping is an anti-pattern in pandas, so iterrows should almost always be avoided.

On top of that, after doing all those expensive dataframe loops, you then discard the dataframe labels and just return a numpy array z.


Suggested pandas approach:

Compute in long form and then reshape into your desired wide form:

  1. Merge into a long cross table
  2. Compute the weights and weighted prices
  3. Pivot into a wide cross table
  4. Zero out the diagonal

This not only reduces your 100+ lines of code to ~15 lines:

But is also significantly faster:

>>> %timeit original()
15.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit suggested()
1.82 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

And retains the dataframe labels:

Product_y        Product 1                                 Product 2
Year_y              Year 1  Year 2  Year 3  Year 4  Year 5    Year 1   Year 2   Year 3   Year 4   Year 5
Product_x Year_x                                                                                        
Product 1 Year 1       0.0   748.0  1020.0  1700.0  1496.0    2720.0   1400.8   2040.0   4202.4   2720.0
          Year 2     748.0     0.0   660.0  1100.0   968.0    1760.0    906.4   1320.0   2719.2   1760.0
          Year 3    1020.0   660.0     0.0  1500.0  1320.0    2400.0   1236.0   1800.0   3708.0   2400.0
          Year 4    1700.0  1100.0  1500.0     0.0  2200.0    4000.0   2060.0   3000.0   6180.0   4000.0
          Year 5    1496.0   968.0  1320.0  2200.0     0.0    3520.0   1812.8   2640.0   5438.4   3520.0
Product 2 Year 1    2720.0  1760.0  2400.0  4000.0  3520.0       0.0  20600.0  30000.0  61800.0  40000.0
          Year 2    1400.8   906.4  1236.0  2060.0  1812.8   20600.0      0.0  15450.0  31827.0  20600.0
          Year 3    2040.0  1320.0  1800.0  3000.0  2640.0   30000.0  15450.0      0.0  46350.0  30000.0
          Year 4    4202.4  2719.2  3708.0  6180.0  5438.4   61800.0  31827.0  46350.0      0.0  61800.0
          Year 5    2720.0  1760.0  2400.0  4000.0  3520.0   40000.0  20600.0  30000.0  61800.0      0.0
added 232 characters in body
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21
Loading
edited body
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21
Loading
Source Link
tdy
  • 2.3k
  • 1
  • 10
  • 21
Loading