Faster way to create pandas dataframe with many rows

Question

I am reading hdf5 files with large amounts of data . I want to store it in a dataframe (it will contain around 1.3e9 rows). For the moment I am using the following procedure:

df = pd.DataFrame()
for key in ['Column1', 'Column2', 'Column3']:
    df[key] = np.array(h5assembly.get(key))

I have timed it and it takes ~110 seconds

If I just assign the values to numpy arrays, like this:

v1 = np.array(h5assembly.get('Column1'))
v2 = np.array(h5assembly.get('Column2'))
v3 = np.array(h5assembly.get('Column3'))

It takes ~22 seconds.

Am I doing something wrong? Is it expected that the creation of the dataframe is so much slower? Is there any way to accelerate this process?

Allentro · Accepted Answer · 2021-03-02 19:22:31Z

Yes, it is expected that a DataFrame will take longer than Numpy arrays. This is due to various reasons and I won't list them all. Partly due to the may Numpy uses and frees up memory. Numpy operations are implemented in C, a compiled language giving performance benefits.

An interesting comparison between pandas and Numpy performance may be seen here: https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/

A package that aims to speed up Pandas using parallelization is Molin: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html

Here is also a package called 'PyPolars' which aims to work in a very similar way to Pandas with greater performance due to the implementation of Rust: https://www.analyticsvidhya.com/blog/2021/02/is-pypolars-the-new-alternative-to-pandas/

Vishnudev Krishnadas · Accepted Answer · 2021-03-02 19:26:26Z

0

You can use pandas.read_hdf to read hdf files directly into a dataframe.

df = pd.read_hdf('./store.h5')

answered Mar 2, 2021 at 19:26

Vishnudev Krishnadas

11k2 gold badges29 silver badges58 bronze badges

1 Comment

xkudsraw Over a year ago

I get the answer ValueError: No dataset in HDF5 file.

Collectives™ on Stack Overflow

Faster way to create pandas dataframe with many rows

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related