I need to make a strategic decision about choice of the basis for data structure holding statistical data frames in my program.
I store hundreds of thousands of records in one big table. Each field would be of a different type, including short strings. I'd perform multiple regression analysis and manipulations on the data that need to be done quick, in real time. I also need to use something, that is relatively popular and well supported.
I know about the following contestants:
list of array.array
That is the most basic thing to do. Unfortunately it doesn't support strings. And I need to use numpy anyway for its statistical part, so this one is out of question.
numpy.ndarray
The ndarray has ability to hold arrays of different types in each column (e.g. np.dtype([('name', np.str_, 16), ('grades', np.float64, (2,))])). It seems a natural winner, but...
pandas.DataFrame
This one is built with statistical use in mind, but is it efficient enough?
I read, that the pandas.DataFrame is no longer based on the numpy.ndarray (although it shares the same interface). Can anyone shed some light on it? Or maybe there is an even better data structure out there?
pandas.DataFrameis no longer based on thenumpy.ndarray". Not really - the API change you're referring to just means thatpandas.SeriessubclassesNDFramerather than directly subclassingnumpy.ndarray, but internal storage used byNDFramestill consists ofnumpy.ndarrays.