Efficient merge for many huge csv files

Question

I have a script that takes all the csv files in a directory and merges them side-by-side, using an outer join. The problem is that my computer chokes (MemoryError) when I try to use it on the files I need to join (about two dozen files 6-12 Gb each). I am aware that itertools can be used to make loops more efficient, but I am unclear as to whether or how it could be applied to this situation. The other alternative I can think of is to install mySQL, learn the basics, and do this there. Obviously I'd rather do this in Python if possible because I'm already learning it. An R-based solution would also be acceptable.

Here is my code:

import os
import glob
import pandas as pd
os.chdir("\\path\\containing\\files")

files = glob.glob("*.csv")
sdf = pd.read_csv(files[0], sep=',')

for filename in files[1:]:
    df = pd.read_csv(filename, sep=',')
    sdf = pd.merge(sdf, df, how='outer', on=['Factor1', 'Factor2'])

Any advice for how to do this with files too big for my computer's memory would be greatly appreciated.

Kartik · Accepted Answer · 2016-08-06 03:19:29Z

6

Use HDF5, that in my opinion would suit your needs very well. It also handles out-of-core queries, so you won't have to face MemoryError.

import os
import glob
import pandas as pd
os.chdir("\\path\\containing\\files")

files = glob.glob("*.csv")
hdf_path = 'my_concatenated_file.h5'

with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
    # This compresses the final file by 5 using blosc. You can avoid that or
    # change it as per your needs.
    for filename in files:
        store.append('table_name', pd.read_csv(filename, sep=','), index=False)
    # Then create the indexes, if you need it
    store.create_table_index('table_name', columns=['Factor1', 'Factor2'], optlevel=9, kind='full')

edited Aug 6, 2016 at 3:19

answered Aug 6, 2016 at 1:37

Kartik

8,73345 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Stonecraft Over a year ago

that threw a series of errors, the first of which was: line 2885, in run_code exec(code_obj, self.user_global_ns, self.user_ns)

Stonecraft Over a year ago

Ok thanks. Eventually the result will need to be in csv, but there is nothing to stop me from saving as such afterwards, right? And the files I need to put together DO share the same columns, I was getting that error because of an irrelevant file that was in the same directory. So it looks like your solution will work!

Kartik Over a year ago

Nope, nothing should stop you from going from HDF to csv. HDF just lets you do the querying out-of-core so you can join quite easily. For instance, you can read the main table in chunks, extract the values of 'Factor1' and 'Factor2' and get only the rows which contain those values from all other tables, merge them and write them to a csv file. You will note that HDF5 is much faster and manageable than csv. So, unless you have a compelling need to go back to csv, I think you are better off staying in HDF5. And HDF5 will soon have an ODBC driver: hdfgroup.org/wp/tag/hdf5-odbc-driver

Kartik Over a year ago

Since your files do share the same column structure, I suggest you also take a look at these answers: stackoverflow.com/questions/15798209/… and stackoverflow.com/questions/25459982/… They will help you query your final table such that you only end up with rows sharing the same elements in 'Factor1' and 'Factor2', which you can easily reshape to get a side-by-side table for your final csv output. Also, use the previous version of my answer.

Stonecraft Over a year ago

Thanks for the additional info. I tried your code with my actual data and it was a lot faster (as in, it finished without error faster than the initial attempt crashed). I'm still figuring out how to work with h5 in R (my destination for the joined data), but there seems to be a lot of help available for that, so I should be good from here. Marked as answered :)

|

Mike Graham · Accepted Answer · 2016-08-06 01:24:26Z

0

There is a chance dask will be well-suited to your use. It might depend on what you want to do after the merge.

answered Aug 6, 2016 at 1:24

Mike Graham

77.1k16 gold badges105 silver badges131 bronze badges

1 Comment

Stonecraft Over a year ago

I'm going to save it into a single enormous csv.

Community · Accepted Answer · 2017-05-23 12:25:48Z

0

You should be able to do this with python but i don't think reading the csv's at once will be the most efficient use of your memory.

How to read a CSV file from a stream and process each line as it is written?

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered Aug 6, 2016 at 1:28

Shawn K

7785 silver badges13 bronze badges

2 Comments

Stonecraft Over a year ago

I'm not sure if I am understanding how the stream works, but I think it might be a problems because I'm not simply concatenating lines together, rather the presence or absence of the same key in different files changes how the lines will be aligned.

Shawn K Over a year ago

you could do whatever you want as long as it fit in memory. only read in the bits you need, as you need them, and flush what is complete/matched to disk.

Collectives™ on Stack Overflow

Efficient merge for many huge csv files

3 Answers 3

12 Comments

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

12 Comments

1 Comment

2 Comments

Linked

Related