I have the following dataframe (called df):
user_id product_id probReorder
0 1 196 1.0
1 1 10258 0.9
2 1 10326 0.1
3 1 12427 1.0
4 1 13032 0.3
...
For each user_id in df, I'd like to retain only the N rows with the largest values in the "probReorder" column. Also, I want N to depend on user_id. In my current approach, I have a dict "lastReordNumber" whose key-value pairs are (user_id, int), and I select the rows as follows:
predictions = []
for usr,data in df.groupby(by="user_id"):
data = data.nlargest(lastReordNumber[usr], "probReorder")
predictions.append(data)
df = pd.concat(predictions)
The problem is that this is really slow. The dataframe has around 13M rows and 200k unique user_id's. Is there a faster/better approach?
EDIT: The previous code produces unexpected output when there are duplicate values in the probReorder column for a given user_id. Example:
lastReordNumber = {1:2, 2:3}
df = pd.DataFrame({"user_id":[1,1,1,2,2,2,2],"probReorder":[0.9,0.6,0.9,0.1,1,0.5,0.4],\
"product_id":[1,2,3,4,5,6,7]})
I get the output:
probReorder product_id user_id
0 0.9 1 1
1 0.9 3 1
2 0.9 1 1
3 0.9 3 1
4 1.0 5 2
5 0.5 6 2
6 0.4 7 2
which for user_id=2 is what I expect, but for user_id=1 there are duplicate rows. My expected output is:
probReorder product_id user_id
0 0.9 1 1
1 0.9 3 1
2 1.0 5 2
3 0.5 6 2
4 0.4 7 2
This can be obtained by using the simpler piece of code
predictions = []
for usr,data in df.groupby(by="user_id"):
predictions.append(data.sort_values('probReorder', ascending=False).head(lastReordNumber[usr]))
predictions = pd.concat(predictions, ignore_index=True)
in which each column is sorted completely and then truncated. This is also reasonably efficient. I haven't understood yet how to interpret the result of the nlargest() method, though.