I am running out of RAM while outright generating a 4,500 x 1,000,000 DataFrame of correlated simulations. In the code below, I break the simulations into ten parts (10 instances of 100,000 simulations / instance for each of the 4,500 time series connected via the rank correlation matrix corr_matrix), which allows me to stay just under the RAM ceiling:
import pandas as pd
import os
from multiprocessing import Pool
from scipy.stats.distributions import t
from time import time
from statsmodels.sandbox.distributions.multivariate import multivariate_t_rvs as mv_t
filename_prefix = 'generation\\copulas'
def sim(iterable) -> pd.DataFrame:
corr_file, year, part_num, n_sims, df = iterable
corr = pd.read_pickle(corr_file)
copula = pd.DataFrame(t.cdf(mv_t(m=([0] * corr.shape[0]), S=corr, df=df, n=n_sims), df=df))
copula.columns = corr.columns
copula.columns.names = corr.columns.names
copula.to_pickle('%s\\year_%s\\part_%s.pkl' % (filename_prefix, (year + 1), part_num))
return copula
def foo(corr_file: str, n_years: int, n_sims: int, n_parts: int = 10, df: int = 3):
start = time()
for year in range(n_years):
part_size: int = int(n_sims / 10)
temp_dir: str = '%s\\year_%s' % (filename_prefix, year + 1)
temp_file: str = '%s\\year' % temp_dir
os.makedirs('%s\\year_%s' % (filename_prefix, year + 1))
with Pool(3) as p:
collection = p.map(func=sim, iterable=[(corr_file, year, x, part_size, df) for x in range(n_parts)])
temp = pd.concat(collection)
temp.to_pickle('%s\\year_%s.pkl' % (filename_prefix, year + 1))
print('\tRun time = %s' % (time() - start))
My questions are:
- Why do I run out of memory when I create a single 4,500 x 1,000,000 DataFrame but not when I create ten 4,500 x 100,000 DataFrames?
- Is there anything I can do to reduce my memory usage?
- Are there any egregious mistakes or poor practices in the above code?
Thank you for your kind assistance and time!