0

I am reading hdf5 files with large amounts of data . I want to store it in a dataframe (it will contain around 1.3e9 rows). For the moment I am using the following procedure:

df = pd.DataFrame()
for key in ['Column1', 'Column2', 'Column3']:
    df[key] = np.array(h5assembly.get(key))

I have timed it and it takes ~110 seconds

If I just assign the values to numpy arrays, like this:

v1 = np.array(h5assembly.get('Column1'))
v2 = np.array(h5assembly.get('Column2'))
v3 = np.array(h5assembly.get('Column3'))

It takes ~22 seconds.

Am I doing something wrong? Is it expected that the creation of the dataframe is so much slower? Is there any way to accelerate this process?

2 Answers 2

1

Yes, it is expected that a DataFrame will take longer than Numpy arrays. This is due to various reasons and I won't list them all. Partly due to the may Numpy uses and frees up memory. Numpy operations are implemented in C, a compiled language giving performance benefits.

An interesting comparison between pandas and Numpy performance may be seen here: https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/

A package that aims to speed up Pandas using parallelization is Molin: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html

Here is also a package called 'PyPolars' which aims to work in a very similar way to Pandas with greater performance due to the implementation of Rust: https://www.analyticsvidhya.com/blog/2021/02/is-pypolars-the-new-alternative-to-pandas/

Sign up to request clarification or add additional context in comments.

Comments

0

You can use pandas.read_hdf to read hdf files directly into a dataframe.

df = pd.read_hdf('./store.h5')

1 Comment

I get the answer ValueError: No dataset in HDF5 file.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.