I have been testing out pandas and pytables for some large financial data sets, and have run in to a real stumbling block:
When storing in a pytables file, pandas appears to be storing multidimensional data in massively long rows, not columns.
try this:
from pandas import *
df = DataFrame({'col1':randn(100000000),'col2':randn(100000000)})
store = HDFStore('test.h5')
store['data'] = df #should be a warning here about exceeding the maximum recommended rowsize
store.handle
output:
File(filename=test7.h5, title='', mode='a', rootUEP='/', filters=Filters(complevel=0, shuffle=False, fletcher32=False))
/ (RootGroup) ''
/data (Group) ''
/data/axis0 (Array(2,)) ''
atom := StringAtom(itemsize=4, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/data/axis1 (Array(100000000,)) ''
atom := Int64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := None
/data/block0_items (Array(2,)) ''
atom := StringAtom(itemsize=4, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/data/block0_values (Array(2, 100000000)) ''
atom := Float64Atom(shape=(), dflt=0.0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := None
I'm not totally sure, but i reckon that combined with the error message, the Array(2,100000000) means a 2D array with 2 rows and 100,000,000 columns. This is also the way it's shown in HDFView.
I've been experiencing extremely poor performance (10 seconds for data['ticks'].head() in some cases), is this what's to blame?