Pandas Pytables warnings and slow performance

Question

I have been testing out pandas and pytables for some large financial data sets, and have run in to a real stumbling block:

When storing in a pytables file, pandas appears to be storing multidimensional data in massively long rows, not columns.

try this:

from pandas import *
df = DataFrame({'col1':randn(100000000),'col2':randn(100000000)})
store = HDFStore('test.h5')
store['data'] = df    #should be a warning here about exceeding the maximum recommended rowsize
store.handle

output:

File(filename=test7.h5, title='', mode='a', rootUEP='/', filters=Filters(complevel=0, shuffle=False, fletcher32=False))
/ (RootGroup) ''
/data (Group) ''
/data/axis0 (Array(2,)) ''
  atom := StringAtom(itemsize=4, shape=(), dflt='')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/data/axis1 (Array(100000000,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/data/block0_items (Array(2,)) ''
  atom := StringAtom(itemsize=4, shape=(), dflt='')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/data/block0_values (Array(2, 100000000)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

I'm not totally sure, but i reckon that combined with the error message, the Array(2,100000000) means a 2D array with 2 rows and 100,000,000 columns. This is also the way it's shown in HDFView.

I've been experiencing extremely poor performance (10 seconds for data['ticks'].head() in some cases), is this what's to blame?

github.com/pydata/pandas/pull/1834 and github.com/pydata/pandas/issues/1824 fix this. Is data now saved column-wise? Is that optimal with regards to time series data and the saving there-of column-wise like q/kdb+ etc. do? I couldn't really find much else on how data is saved with HDFStore besides it being PyTables. — Konsta
– Konsta, Commented Oct 20, 2012 at 12:10
It's currently stored in a rather odd way, try looking at the handle for a hdf5 store you've created for more info, or take a look at it in HDFView. — John_C
– John_C, Commented Oct 22, 2012 at 12:48

Wes McKinney · Accepted Answer · 2012-08-29 22:38:41Z

4

I've cross-linked the issue on GitHub:

http://github.com/pydata/pandas/issues/1824

I was not personally aware of this issue, and frankly it's a bit disappointing that this is a problem for PyTables or HDF5 (whoever is the culprit).

answered Aug 29, 2012 at 22:38

Wes McKinney

106k32 gold badges146 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas Pytables warnings and slow performance

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related