Pandas select rows in DataFrameGroupBy bases on hierarchical index

Question

I have the following dataframe (called df):

   user_id  product_id  probReorder
0        1         196          1.0
1        1       10258          0.9
2        1       10326          0.1
3        1       12427          1.0
4        1       13032          0.3
...

For each user_id in df, I'd like to retain only the N rows with the largest values in the "probReorder" column. Also, I want N to depend on user_id. In my current approach, I have a dict "lastReordNumber" whose key-value pairs are (user_id, int), and I select the rows as follows:

predictions = []
for usr,data in df.groupby(by="user_id"):
    data = data.nlargest(lastReordNumber[usr], "probReorder")
    predictions.append(data)
df = pd.concat(predictions)

The problem is that this is really slow. The dataframe has around 13M rows and 200k unique user_id's. Is there a faster/better approach?

EDIT: The previous code produces unexpected output when there are duplicate values in the probReorder column for a given user_id. Example:

lastReordNumber = {1:2, 2:3}
df = pd.DataFrame({"user_id":[1,1,1,2,2,2,2],"probReorder":[0.9,0.6,0.9,0.1,1,0.5,0.4],\
    "product_id":[1,2,3,4,5,6,7]})

I get the output:

   probReorder  product_id  user_id
0          0.9           1        1
1          0.9           3        1
2          0.9           1        1
3          0.9           3        1
4          1.0           5        2
5          0.5           6        2
6          0.4           7        2

which for user_id=2 is what I expect, but for user_id=1 there are duplicate rows. My expected output is:

   probReorder  product_id  user_id
0          0.9           1        1
1          0.9           3        1
2          1.0           5        2
3          0.5           6        2
4          0.4           7        2

This can be obtained by using the simpler piece of code

predictions = []
for usr,data in df.groupby(by="user_id"):
    predictions.append(data.sort_values('probReorder', ascending=False).head(lastReordNumber[usr]))
predictions = pd.concat(predictions, ignore_index=True)

in which each column is sorted completely and then truncated. This is also reasonably efficient. I haven't understood yet how to interpret the result of the nlargest() method, though.

What happens when you have two or more rows that equal the max? — Bob Haffner
– Bob Haffner, Commented Jul 7, 2017 at 2:57
@BobHaffner good question. It looks like nlargest is not behaving as I expected, and is duplicating some rows. Should I post the output of a test case? — chubecca
– chubecca, Commented Jul 7, 2017 at 3:29
I would post some additional sample data that contains another user_id. And post your desired output too — Bob Haffner
– Bob Haffner, Commented Jul 7, 2017 at 3:38

jezrael · Accepted Answer · 2017-07-07 20:07:45Z

2

You can use sort_values with groupby and head:

df1 = df.sort_values('probReorder', ascending=False)
        .groupby('user_id', group_keys=False)
        .apply(lambda x: x.head([x.name]))
print (df1)
   probReorder  product_id  user_id
0          0.9           1        1
2          0.9           3        1
4          1.0           5        2
5          0.5           6        2
6          0.4           7        2

Another solution with nlargest:

df1 = df.groupby('user_id', group_keys=False)
        .apply(lambda x: x.nlargest(lastReordNumber[x.name], 'probReorder'))
print (df1)
   probReorder  product_id  user_id
0          0.9           1        1
2          0.9           3        1
4          1.0           5        2
5          0.5           6        2
6          0.4           7        2

edited Jul 7, 2017 at 20:07

answered Jul 7, 2017 at 5:14

jezrael

867k102 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

chubecca Over a year ago

Thanks for the answer. A few comments: drop_duplicates() does not do anything in this case as there are no duplicate (user_id, product_id) pairs. Your first solution should be equivalent to the one I provided in the edit, but it's more elegant and maybe more efficient. Your second solution does not work correctly on my machine, it produces the same "wrong" output I provided above. It could be a bug in nlargest(), I have to look it up.

chubecca Over a year ago

As I said there are no duplicates in the ("user_id", "product_id") columns, correct me if I'm wrong, so your call to drop_duplicates does not do anything. Your two solutions are equivalent to my two solutions, but one of them does not behave as expected on my system. I consider my original question solved, but I still don't understand the issue with nlargest().

jezrael Over a year ago

If no duplicates, simply remove drop_duplicates. And why nlargest does not work - it is hard question. I dont know. Maybe bug. For me it works nice in pandas 0.20.2. Do you use last version of pandas? print (pd.show_versions())

chubecca Over a year ago

I have the version 0.19.2

jezrael Over a year ago

Is posssible upgrade?

Collectives™ on Stack Overflow

Pandas select rows in DataFrameGroupBy bases on hierarchical index

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related