Reducing memory usage with Pandas DataFrame

Question

I am running out of RAM while outright generating a 4,500 x 1,000,000 DataFrame of correlated simulations. In the code below, I break the simulations into ten parts (10 instances of 100,000 simulations / instance for each of the 4,500 time series connected via the rank correlation matrix corr_matrix), which allows me to stay just under the RAM ceiling:

import pandas as pd
import os
from multiprocessing import Pool
from scipy.stats.distributions import t
from time import time
from statsmodels.sandbox.distributions.multivariate import multivariate_t_rvs as mv_t

filename_prefix = 'generation\\copulas'


def sim(iterable) -> pd.DataFrame:
    corr_file, year, part_num, n_sims, df = iterable
    corr = pd.read_pickle(corr_file)
    copula = pd.DataFrame(t.cdf(mv_t(m=([0] * corr.shape[0]), S=corr, df=df, n=n_sims), df=df))
    copula.columns = corr.columns
    copula.columns.names = corr.columns.names
    copula.to_pickle('%s\\year_%s\\part_%s.pkl' % (filename_prefix, (year + 1), part_num))
    return copula


def foo(corr_file: str, n_years: int, n_sims: int, n_parts: int = 10, df: int = 3):
    start = time()
    for year in range(n_years):
        part_size: int = int(n_sims / 10)
        temp_dir: str = '%s\\year_%s' % (filename_prefix, year + 1)
        temp_file: str = '%s\\year' % temp_dir
        os.makedirs('%s\\year_%s' % (filename_prefix, year + 1))
        with Pool(3) as p:
            collection = p.map(func=sim, iterable=[(corr_file, year, x, part_size, df) for x in range(n_parts)])
        temp = pd.concat(collection)
        temp.to_pickle('%s\\year_%s.pkl' % (filename_prefix, year + 1))
    print('\tRun time = %s' % (time() - start))

My questions are:

Why do I run out of memory when I create a single 4,500 x 1,000,000 DataFrame but not when I create ten 4,500 x 100,000 DataFrames?
Is there anything I can do to reduce my memory usage?
Are there any egregious mistakes or poor practices in the above code?

Thank you for your kind assistance and time!

You don't create/store ten smaller dataframes at once. You do so in a loop and overwrite the variables as you iterate the loop. — Quang Hoang
– Quang Hoang, Commented Oct 14, 2020 at 17:40
But isn't that what the multiprocessing function does? By assigning three cores to the Pool(), aren't I creating (at most) three DataFrames at once until I've created 10 DataFrames and then combining them in temp? — CasusBelli
– CasusBelli, Commented Oct 14, 2020 at 17:51

Andrew Pye · Accepted Answer · 2020-10-14 18:21:55Z

1

You can try reading in the file and specify "chunk_size". This will also run in a loop, but you will need to totally separate the reading data part of the code from the processing data part of the code. A similar way to accomplish this is by using the module called dask. This module uses dataframes, but automatically splits the data up into manageable sizes.

P.S. seems like there's some confusion about Memory and CPU usage. Check out this question and responses.

answered Oct 14, 2020 at 18:21

Andrew Pye

6524 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

CasusBelli Over a year ago

Thanks for the response. I'll definitely check out Dask. I don't think, however, I'm conflating CPU with Memory in this case. The message I get in PyCharm is to the effect of "unable to allocate X GB" of memory. My CPU usage is about 70-80%. I'm guessing it wouldn't be possible to use GPU in this case?

Andrew Pye Over a year ago

Right, yeah. My point is that using multiprocessing doesn't help manage memory - it helps utilize a larger percentage of CPU, but if the data you're trying to store doesn't fit in RAM (where cpu is reading from), then you'll get that error regardless of what processors are trying to access it.

CasusBelli Over a year ago

Right. I'm using multiprocessing because this would otherwise take forever to run -- a CPU issue. I'm just trying to build a DataFrame without PyCharm telling me I don't have enough RAM (even though I have 64GB). I'd actually like to have a DataFrame with 10 million, instead of 1 million, rows -- but I'm afraid that's definitely out of my league. I'll look into Dask. Hopefully that'll do the trick :)

Collectives™ on Stack Overflow

Reducing memory usage with Pandas DataFrame

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related