numpy.ndarray vs pandas.DataFrame

Question

I need to make a strategic decision about choice of the basis for data structure holding statistical data frames in my program.

I store hundreds of thousands of records in one big table. Each field would be of a different type, including short strings. I'd perform multiple regression analysis and manipulations on the data that need to be done quick, in real time. I also need to use something, that is relatively popular and well supported.

I know about the following contestants:

list of `array.array`

That is the most basic thing to do. Unfortunately it doesn't support strings. And I need to use numpy anyway for its statistical part, so this one is out of question.

`numpy.ndarray`

The ndarray has ability to hold arrays of different types in each column (e.g. np.dtype([('name', np.str_, 16), ('grades', np.float64, (2,))])). It seems a natural winner, but...

`pandas.DataFrame`

This one is built with statistical use in mind, but is it efficient enough?

I read, that the pandas.DataFrame is no longer based on the numpy.ndarray (although it shares the same interface). Can anyone shed some light on it? Or maybe there is an even better data structure out there?

"I read, that the pandas.DataFrame is no longer based on the numpy.ndarray". Not really - the API change you're referring to just means that pandas.Series subclasses NDFrame rather than directly subclassing numpy.ndarray, but internal storage used by NDFrame still consists of numpy.ndarrays. — ali_m
– ali_m, Commented Aug 8, 2014 at 11:03
run some tests. With some test data and an operation that you will most likely do the most, build up a way to do it in both numpy.ndarray and pandas. Time the results to determine which method is faster. You'll notice while building the tests which one has the required functionality that you need, as well as ease of implementation. — Ryan G
– Ryan G, Commented Aug 8, 2014 at 14:03
@RyanG running tests would mean I'd need to make two version of my application, and write more tests, than I deem my application really needs. I chose Python, because I expect to finish this task in few working days max. I asked this question to get an subjective opinion from those of you, who have some experience in both frameworks (or maybe more). — Adam Ryczkowski
– Adam Ryczkowski, Commented Aug 9, 2014 at 13:11
@AdamRyczkowski - You don't necessarily need two full versions of your program. Just extract a single function for testing. The idea behind building the tests is not just to see which is faster, but also to learn each library a bit more. You should discover which library gives you the easiest implementation of what you're trying to do. This may be tedious at first, but you'll gain the knowledge so the next time you come across a similar problem, you'll immediately know what option to take. Having a faster run time is a bonus when coupled with implementation time. But it's your call on this. — Ryan G
– Ryan G, Commented Aug 11, 2014 at 14:49

daniel · Accepted Answer · 2014-08-08 20:45:00Z

26

pandas.DataFrame is awesome, and interacts very well with much of numpy. Much of the DataFrame is written in Cython and is quite optimized. I suspect the ease of use and the richness of the Pandas API will greatly outweigh any potential benefit you could obtain by rolling your own interfaces around numpy.

answered Aug 8, 2014 at 20:45

daniel

2,6261 gold badge26 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Adam Ryczkowski Over a year ago

I know it is against the rules of SO, but it is exactly that opinion, what I was looking for. Thank you!

Collectives™ on Stack Overflow

numpy.ndarray vs pandas.DataFrame

list of `array.array`

`numpy.ndarray`

`pandas.DataFrame`

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

list of array.array

numpy.ndarray

pandas.DataFrame

1 Answer 1

1 Comment

Linked

Related

list of `array.array`

`numpy.ndarray`

`pandas.DataFrame`