20

I need to make a strategic decision about choice of the basis for data structure holding statistical data frames in my program.

I store hundreds of thousands of records in one big table. Each field would be of a different type, including short strings. I'd perform multiple regression analysis and manipulations on the data that need to be done quick, in real time. I also need to use something, that is relatively popular and well supported.

I know about the following contestants:

list of array.array

That is the most basic thing to do. Unfortunately it doesn't support strings. And I need to use numpy anyway for its statistical part, so this one is out of question.

numpy.ndarray

The ndarray has ability to hold arrays of different types in each column (e.g. np.dtype([('name', np.str_, 16), ('grades', np.float64, (2,))])). It seems a natural winner, but...

pandas.DataFrame

This one is built with statistical use in mind, but is it efficient enough?

I read, that the pandas.DataFrame is no longer based on the numpy.ndarray (although it shares the same interface). Can anyone shed some light on it? Or maybe there is an even better data structure out there?

5
  • 1
    "I read, that the pandas.DataFrame is no longer based on the numpy.ndarray". Not really - the API change you're referring to just means that pandas.Series subclasses NDFrame rather than directly subclassing numpy.ndarray, but internal storage used by NDFrame still consists of numpy.ndarrays. Commented Aug 8, 2014 at 11:03
  • 5
    run some tests. With some test data and an operation that you will most likely do the most, build up a way to do it in both numpy.ndarray and pandas. Time the results to determine which method is faster. You'll notice while building the tests which one has the required functionality that you need, as well as ease of implementation. Commented Aug 8, 2014 at 14:03
  • 1
    @RyanG running tests would mean I'd need to make two version of my application, and write more tests, than I deem my application really needs. I chose Python, because I expect to finish this task in few working days max. I asked this question to get an subjective opinion from those of you, who have some experience in both frameworks (or maybe more). Commented Aug 9, 2014 at 13:11
  • @AdamRyczkowski - You don't necessarily need two full versions of your program. Just extract a single function for testing. The idea behind building the tests is not just to see which is faster, but also to learn each library a bit more. You should discover which library gives you the easiest implementation of what you're trying to do. This may be tedious at first, but you'll gain the knowledge so the next time you come across a similar problem, you'll immediately know what option to take. Having a faster run time is a bonus when coupled with implementation time. But it's your call on this. Commented Aug 11, 2014 at 14:49
  • related question: stackoverflow.com/questions/12052067/… Commented Sep 2, 2016 at 17:56

1 Answer 1

26

pandas.DataFrame is awesome, and interacts very well with much of numpy. Much of the DataFrame is written in Cython and is quite optimized. I suspect the ease of use and the richness of the Pandas API will greatly outweigh any potential benefit you could obtain by rolling your own interfaces around numpy.

Sign up to request clarification or add additional context in comments.

1 Comment

I know it is against the rules of SO, but it is exactly that opinion, what I was looking for. Thank you!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.