I am reading hdf5 files with large amounts of data . I want to store it in a dataframe (it will contain around 1.3e9 rows). For the moment I am using the following procedure:
df = pd.DataFrame()
for key in ['Column1', 'Column2', 'Column3']:
df[key] = np.array(h5assembly.get(key))
I have timed it and it takes ~110 seconds
If I just assign the values to numpy arrays, like this:
v1 = np.array(h5assembly.get('Column1'))
v2 = np.array(h5assembly.get('Column2'))
v3 = np.array(h5assembly.get('Column3'))
It takes ~22 seconds.
Am I doing something wrong? Is it expected that the creation of the dataframe is so much slower? Is there any way to accelerate this process?