2

I have the following dataframe (called df):

   user_id  product_id  probReorder
0        1         196          1.0
1        1       10258          0.9
2        1       10326          0.1
3        1       12427          1.0
4        1       13032          0.3
...

For each user_id in df, I'd like to retain only the N rows with the largest values in the "probReorder" column. Also, I want N to depend on user_id. In my current approach, I have a dict "lastReordNumber" whose key-value pairs are (user_id, int), and I select the rows as follows:

predictions = []
for usr,data in df.groupby(by="user_id"):
    data = data.nlargest(lastReordNumber[usr], "probReorder")
    predictions.append(data)
df = pd.concat(predictions)

The problem is that this is really slow. The dataframe has around 13M rows and 200k unique user_id's. Is there a faster/better approach?

EDIT: The previous code produces unexpected output when there are duplicate values in the probReorder column for a given user_id. Example:

lastReordNumber = {1:2, 2:3}
df = pd.DataFrame({"user_id":[1,1,1,2,2,2,2],"probReorder":[0.9,0.6,0.9,0.1,1,0.5,0.4],\
    "product_id":[1,2,3,4,5,6,7]})

I get the output:

   probReorder  product_id  user_id
0          0.9           1        1
1          0.9           3        1
2          0.9           1        1
3          0.9           3        1
4          1.0           5        2
5          0.5           6        2
6          0.4           7        2

which for user_id=2 is what I expect, but for user_id=1 there are duplicate rows. My expected output is:

   probReorder  product_id  user_id
0          0.9           1        1
1          0.9           3        1
2          1.0           5        2
3          0.5           6        2
4          0.4           7        2

This can be obtained by using the simpler piece of code

predictions = []
for usr,data in df.groupby(by="user_id"):
    predictions.append(data.sort_values('probReorder', ascending=False).head(lastReordNumber[usr]))
predictions = pd.concat(predictions, ignore_index=True)

in which each column is sorted completely and then truncated. This is also reasonably efficient. I haven't understood yet how to interpret the result of the nlargest() method, though.

3
  • What happens when you have two or more rows that equal the max? Commented Jul 7, 2017 at 2:57
  • @BobHaffner good question. It looks like nlargest is not behaving as I expected, and is duplicating some rows. Should I post the output of a test case? Commented Jul 7, 2017 at 3:29
  • I would post some additional sample data that contains another user_id. And post your desired output too Commented Jul 7, 2017 at 3:38

1 Answer 1

2

You can use sort_values with groupby and head:

df1 = df.sort_values('probReorder', ascending=False)
        .groupby('user_id', group_keys=False)
        .apply(lambda x: x.head([x.name]))
print (df1)
   probReorder  product_id  user_id
0          0.9           1        1
2          0.9           3        1
4          1.0           5        2
5          0.5           6        2
6          0.4           7        2

Another solution with nlargest:

df1 = df.groupby('user_id', group_keys=False)
        .apply(lambda x: x.nlargest(lastReordNumber[x.name], 'probReorder'))
print (df1)
   probReorder  product_id  user_id
0          0.9           1        1
2          0.9           3        1
4          1.0           5        2
5          0.5           6        2
6          0.4           7        2
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the answer. A few comments: drop_duplicates() does not do anything in this case as there are no duplicate (user_id, product_id) pairs. Your first solution should be equivalent to the one I provided in the edit, but it's more elegant and maybe more efficient. Your second solution does not work correctly on my machine, it produces the same "wrong" output I provided above. It could be a bug in nlargest(), I have to look it up.
As I said there are no duplicates in the ("user_id", "product_id") columns, correct me if I'm wrong, so your call to drop_duplicates does not do anything. Your two solutions are equivalent to my two solutions, but one of them does not behave as expected on my system. I consider my original question solved, but I still don't understand the issue with nlargest().
If no duplicates, simply remove drop_duplicates. And why nlargest does not work - it is hard question. I dont know. Maybe bug. For me it works nice in pandas 0.20.2. Do you use last version of pandas? print (pd.show_versions())
I have the version 0.19.2
Is posssible upgrade?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.