Using multiprocessing map with a pandas dataframe?

Question

I am using (python's) panda's map function to process a big CSV file (~50 gigabytes), like this:

import pandas as pd

df = pd.read_csv("huge_file.csv")
df["results1"], df["results2"] = df.map(foo)
df.to_csv("output.csv")

Is there a way I can use parallelization on this? Perhaps using multiprocessing's map function?

Thanks, Jose

Community · Accepted Answer · 2017-05-23 12:21:24Z

2

See docs on reading by chunks here, example here, and appending here

You are much better off reading your csv in chunks, processing, then writing it out to a csv (of course you evven better off converting to HDF).

Takes a relatively constant amount of memory
efficient, can be done in parallel (usually requires having an HDF file that you can select sections from though; a csv is not good for this).
less complicated that trying to do multi-processing directly

edited May 23, 2017 at 12:21

CommunityBot

11 silver badge

answered May 8, 2014 at 15:44

Jeff

129k21 gold badges223 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ely Over a year ago

Note that (much like sharding in a Mongo database) chunk-level parallelism doesn't work well if you need overlapping data (like a rolling time series regression) in the operations to be mapped. In those cases, it's much faster to form the pandas groups first and manually dispatch them to different resources for computing, like each scattered to an engine in IPython.parallel.

Collectives™ on Stack Overflow

Using multiprocessing map with a pandas dataframe?

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related