0

I am using (python's) panda's map function to process a big CSV file (~50 gigabytes), like this:

import pandas as pd

df = pd.read_csv("huge_file.csv")
df["results1"], df["results2"] = df.map(foo)
df.to_csv("output.csv")

Is there a way I can use parallelization on this? Perhaps using multiprocessing's map function?

Thanks, Jose

1 Answer 1

2

See docs on reading by chunks here, example here, and appending here

You are much better off reading your csv in chunks, processing, then writing it out to a csv (of course you evven better off converting to HDF).

  • Takes a relatively constant amount of memory
  • efficient, can be done in parallel (usually requires having an HDF file that you can select sections from though; a csv is not good for this).
  • less complicated that trying to do multi-processing directly
Sign up to request clarification or add additional context in comments.

1 Comment

Note that (much like sharding in a Mongo database) chunk-level parallelism doesn't work well if you need overlapping data (like a rolling time series regression) in the operations to be mapped. In those cases, it's much faster to form the pandas groups first and manually dispatch them to different resources for computing, like each scattered to an engine in IPython.parallel.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.