4

I have been testing out pandas and pytables for some large financial data sets, and have run in to a real stumbling block:

When storing in a pytables file, pandas appears to be storing multidimensional data in massively long rows, not columns.

try this:

from pandas import *
df = DataFrame({'col1':randn(100000000),'col2':randn(100000000)})
store = HDFStore('test.h5')
store['data'] = df    #should be a warning here about exceeding the maximum recommended rowsize
store.handle

output:

File(filename=test7.h5, title='', mode='a', rootUEP='/', filters=Filters(complevel=0, shuffle=False, fletcher32=False))
/ (RootGroup) ''
/data (Group) ''
/data/axis0 (Array(2,)) ''
  atom := StringAtom(itemsize=4, shape=(), dflt='')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/data/axis1 (Array(100000000,)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
/data/block0_items (Array(2,)) ''
  atom := StringAtom(itemsize=4, shape=(), dflt='')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
/data/block0_values (Array(2, 100000000)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None

I'm not totally sure, but i reckon that combined with the error message, the Array(2,100000000) means a 2D array with 2 rows and 100,000,000 columns. This is also the way it's shown in HDFView.

I've been experiencing extremely poor performance (10 seconds for data['ticks'].head() in some cases), is this what's to blame?

2
  • 1
    github.com/pydata/pandas/pull/1834 and github.com/pydata/pandas/issues/1824 fix this. Is data now saved column-wise? Is that optimal with regards to time series data and the saving there-of column-wise like q/kdb+ etc. do? I couldn't really find much else on how data is saved with HDFStore besides it being PyTables. Commented Oct 20, 2012 at 12:10
  • It's currently stored in a rather odd way, try looking at the handle for a hdf5 store you've created for more info, or take a look at it in HDFView. Commented Oct 22, 2012 at 12:48

1 Answer 1

4

I've cross-linked the issue on GitHub:

http://github.com/pydata/pandas/issues/1824

I was not personally aware of this issue, and frankly it's a bit disappointing that this is a problem for PyTables or HDF5 (whoever is the culprit).

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.