I have a very large pandas data frame and want to sample rows from it for modeling, and I encountered out of memory errors like this:
MemoryError: Unable to allocate 6.59 GiB for an array with shape (40, 22117797) and data type float64
This error is weired since I don't need allocate such large amount of memory since my sampled dataframe is only 1% of the original data. Below is my code.
Specifically, the original data has 20 million of rows and most of them are np.float64 data. After loading the data from parquet file using pyarrow, the jupyter kernel takes about 3 GB memory. After the variable assignment using "d0['r_%s'%(t)] = d0.col0", the kernel takes 6 GB. However, once I run the sampling command "d0s = d0.iloc[id1,:]", the memory goes up to 13 GB and the program stops due to out-of-memory error above.
Below code is a minimal working example to reproduce the error on a 16GB memory machine using pandas 1.2.3.
import pandas as pd
import numpy as np
d0 = pd.DataFrame(np.random.rand(22117797, 12))
for t in range(30):
d0['r_%s'%(t)] = d0[0]
id1 = np.random.randint(low = 0, high = d0.shape[0], size = round(d0.shape[0]*0.01))
d0s = d0.iloc[id1,:]
Note the the following code won't generate error if I directly generate a big dataframe:
import pandas as pd
import numpy as np
d0 = pd.DataFrame(np.random.rand(22117797, 42))
id1 = np.random.randint(low = 0, high = d0.shape[0], size = round(d0.shape[0]*0.01))
d0s = d0.iloc[id1,:]