Large data using pandas

I have a very big CSV file (tens of Gigas) containing web logs with the following columns: user_id, time_stamp, category_clicked. I have to build a scorer to identify what categories users like and dislike.

My problem comes when I have to load my CSV using pandas.read_csv . Indeed, I would like to use the chunksize parameter to split it, but since I have to proceed a 'groupby operation' on the user_ids to make my calculation (I don't want my score to be too trivial), I don't know how to smartly split my data because If I only use chunsize, I won't be able to properly use groupby.

To be simple, I want to do a calculation for each user, which depends on the timestamp and the category clicked. For instance, give 1 point to the user if his click happened one month ago, 2 points if it happened two weeks ago, and 4 points if it happened last week.

How can I do? And I am missing something?

edited Aug 19, 2014 at 15:52

asked Aug 19, 2014 at 15:39

sweeeeeet

1,8394 gold badges28 silver badges54 bronze badges

Possibly related: Pandas GroupBy Mean of Large DataSet in CSV.

chrisaycock
– chrisaycock

2014-08-19 15:45:09 +00:00
Commented Aug 19, 2014 at 15:45
I am sorry but It is not really related, because the solution given is very specific to the function mean(), and won't work in my case.

sweeeeeet
– sweeeeeet

2014-08-19 15:49:42 +00:00
Commented Aug 19, 2014 at 15:49
2

You basically need to do this: stackoverflow.com/questions/15798209/…. In a nutshell, read in your data using read_csv, save to a hdfstore (table format). Then you can get the keys of the groupby (user_id), and aggregate as needed with a minimum of queries. This is quite scalable.

Jeff
– Jeff

2014-08-19 16:04:20 +00:00
Commented Aug 19, 2014 at 16:04
I would really appreciate if someone could give me more details in the case that there is more than a million groups. I think It could be usefull to more than me.

sweeeeeet
– sweeeeeet

2014-08-21 07:41:26 +00:00
Commented Aug 21, 2014 at 7:41

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Large data using pandas

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked