0

I have converted a large (6GB) dataframe to a hdf5 file. It contains roughly 30m transactions from 100k user ids over 10 years. I am looking to recreate the state of each user id at 48 different points in time. So for example "for each user ID how many transactions did they have before '1 Jan 2012'". This is one summary field.

My current method is to query the hdf5 for each user_id using the pandas query function for hdf5, then cycle through each date and output the summary at the certain point in time. The issue with this is that it takes too long per user id - sometimes 10-20 seconds each. It appears that the hdf5 query takes the most time.

Is there a way that I can create an index on the hdf5 file to speed up the query or would sorting it by user_id help? All of the data manipulation is in pandas but I am unable to distribute the work across multiple machines due to environment constraints.

3
  • Take a look at the indexing in HDF5 docs, and the cookbook. Especially, refer to this and this answers. Commented Sep 22, 2016 at 5:06
  • What is your chunking scheme in hdf5 file? Try to order it so that you read minimal amount of chunks each time. Commented Sep 22, 2016 at 15:33
  • I don't know what the chunking scheme is. Initially I had the data in month by month form then I wrote a loop to append it all to one file. Commented Sep 23, 2016 at 9:38

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.