2

I have a very big CSV file (tens of Gigas) containing web logs with the following columns: user_id, time_stamp, category_clicked. I have to build a scorer to identify what categories users like and dislike.

My problem comes when I have to load my CSV using pandas.read_csv . Indeed, I would like to use the chunksize parameter to split it, but since I have to proceed a 'groupby operation' on the user_ids to make my calculation (I don't want my score to be too trivial), I don't know how to smartly split my data because If I only use chunsize, I won't be able to properly use groupby.

To be simple, I want to do a calculation for each user, which depends on the timestamp and the category clicked. For instance, give 1 point to the user if his click happened one month ago, 2 points if it happened two weeks ago, and 4 points if it happened last week.

How can I do? And I am missing something?

4
  • Possibly related: Pandas GroupBy Mean of Large DataSet in CSV. Commented Aug 19, 2014 at 15:45
  • I am sorry but It is not really related, because the solution given is very specific to the function mean(), and won't work in my case. Commented Aug 19, 2014 at 15:49
  • 2
    You basically need to do this: stackoverflow.com/questions/15798209/…. In a nutshell, read in your data using read_csv, save to a hdfstore (table format). Then you can get the keys of the groupby (user_id), and aggregate as needed with a minimum of queries. This is quite scalable. Commented Aug 19, 2014 at 16:04
  • I would really appreciate if someone could give me more details in the case that there is more than a million groups. I think It could be usefull to more than me. Commented Aug 21, 2014 at 7:41

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.