4

I have a script that takes all the csv files in a directory and merges them side-by-side, using an outer join. The problem is that my computer chokes (MemoryError) when I try to use it on the files I need to join (about two dozen files 6-12 Gb each). I am aware that itertools can be used to make loops more efficient, but I am unclear as to whether or how it could be applied to this situation. The other alternative I can think of is to install mySQL, learn the basics, and do this there. Obviously I'd rather do this in Python if possible because I'm already learning it. An R-based solution would also be acceptable.

Here is my code:

import os
import glob
import pandas as pd
os.chdir("\\path\\containing\\files")

files = glob.glob("*.csv")
sdf = pd.read_csv(files[0], sep=',')

for filename in files[1:]:
    df = pd.read_csv(filename, sep=',')
    sdf = pd.merge(sdf, df, how='outer', on=['Factor1', 'Factor2'])

Any advice for how to do this with files too big for my computer's memory would be greatly appreciated.

3 Answers 3

6

Use HDF5, that in my opinion would suit your needs very well. It also handles out-of-core queries, so you won't have to face MemoryError.

import os
import glob
import pandas as pd
os.chdir("\\path\\containing\\files")

files = glob.glob("*.csv")
hdf_path = 'my_concatenated_file.h5'

with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
    # This compresses the final file by 5 using blosc. You can avoid that or
    # change it as per your needs.
    for filename in files:
        store.append('table_name', pd.read_csv(filename, sep=','), index=False)
    # Then create the indexes, if you need it
    store.create_table_index('table_name', columns=['Factor1', 'Factor2'], optlevel=9, kind='full')
Sign up to request clarification or add additional context in comments.

12 Comments

that threw a series of errors, the first of which was: line 2885, in run_code exec(code_obj, self.user_global_ns, self.user_ns)
Ok thanks. Eventually the result will need to be in csv, but there is nothing to stop me from saving as such afterwards, right? And the files I need to put together DO share the same columns, I was getting that error because of an irrelevant file that was in the same directory. So it looks like your solution will work!
Nope, nothing should stop you from going from HDF to csv. HDF just lets you do the querying out-of-core so you can join quite easily. For instance, you can read the main table in chunks, extract the values of 'Factor1' and 'Factor2' and get only the rows which contain those values from all other tables, merge them and write them to a csv file. You will note that HDF5 is much faster and manageable than csv. So, unless you have a compelling need to go back to csv, I think you are better off staying in HDF5. And HDF5 will soon have an ODBC driver: hdfgroup.org/wp/tag/hdf5-odbc-driver
Since your files do share the same column structure, I suggest you also take a look at these answers: stackoverflow.com/questions/15798209/… and stackoverflow.com/questions/25459982/… They will help you query your final table such that you only end up with rows sharing the same elements in 'Factor1' and 'Factor2', which you can easily reshape to get a side-by-side table for your final csv output. Also, use the previous version of my answer.
Thanks for the additional info. I tried your code with my actual data and it was a lot faster (as in, it finished without error faster than the initial attempt crashed). I'm still figuring out how to work with h5 in R (my destination for the joined data), but there seems to be a lot of help available for that, so I should be good from here. Marked as answered :)
|
0

There is a chance dask will be well-suited to your use. It might depend on what you want to do after the merge.

1 Comment

I'm going to save it into a single enormous csv.
0

You should be able to do this with python but i don't think reading the csv's at once will be the most efficient use of your memory.

How to read a CSV file from a stream and process each line as it is written?

2 Comments

I'm not sure if I am understanding how the stream works, but I think it might be a problems because I'm not simply concatenating lines together, rather the presence or absence of the same key in different files changes how the lines will be aligned.
you could do whatever you want as long as it fit in memory. only read in the bits you need, as you need them, and flush what is complete/matched to disk.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.