Using MultiIndex on DataFrame

Question

This is follow-up question to the answer for this question:

pandas performance issue - need help to optimize

The following suggestion works:

df = DataFrame(np.arange(20).reshape(5,4))
df2 = df.set_index(keys=[0,1,2])
df2.ix[(4,5,6)]

for using a MultiIndex

So I created a file sample_data.csv that looks like this:

col1,col2,year,amount 
111111,3.5,2012,700 
111112,3.5,2011,600 
222221,4.0,2012,222 
...

I then ran the following:

import numpy as np 
import pandas as pd 
sd=pd.read_csv('sample_data.csv') 
sd2=sd.set_index(keys=['col2','year']) 
sd2.ix[(4.0,2012)]

But this produces the following error: IndexError: index out of bounds

Any ideas why it works in the former case but not the latter? This is what the error looks like:

IndexError                                Traceback (most recent call last)
<ipython-input-19-1d72a961db95> in <module>()
----> 1 sd2.ix[(4.0,2012)]

/Library/Python/2.7/site-packages/pandas-0.8.1-py2.7-macosx-10.7-intel.egg/pandas/core/indexing.pyc in __getitem__(self, key)
     31                 pass
     32 
---> 33             return self._getitem_tuple(key)
     34         else:
     35             return self._getitem_axis(key, axis=0)

For me your code works. Which version of pandas are you using? — joris
– joris, Commented Feb 7, 2013 at 12:42
It works for me as well (in Pd 10.0). You can also skip the set_index step if you use: pd.read_csv('sample_data.csv', index_col=['col2','year']) — Rutger Kassies
– Rutger Kassies, Commented Feb 7, 2013 at 12:53
It could be a bug in pandas-0.8.1, I don't know. Anyway, if it is possible, you better upgrade your version (pandas is still evolving rapidly, also a lot new features) — joris
– joris, Commented Feb 7, 2013 at 13:35
I switched to using pandas 10 and I still get the same error. Are you using the same as expression as I am using above i.e. sd2.ix[(4.0,2012)] — femibyte
– femibyte, Commented Feb 7, 2013 at 23:25

joris · Accepted Answer · 2013-02-08 09:06:34Z

To show it works for me (pandas 0.10.1):

In [1]: from StringIO import StringIO
In [2]: import numpy as np 
In [3]: import pandas as pd 
In [4]: s = StringIO("""col1,col2,year,amount 
   ...: 111111,3.5,2012,700 
   ...: 111112,3.5,2011,600 
   ...: 222221,4.0,2012,222""")

In [5]: sd=pd.read_csv(s) 
In [6]: sd2=sd.set_index(keys=['col2','year']) 
In [7]: sd2.ix[(4.0,2012)] 
Out[7]: 
col1       222221
amount        222
Name: (4.0, 2012)

However, if I add a row with a duplicate index, I also get the same error:

In [8]: s = StringIO("""col1,col2,year,amount 
   ...: 111111,3.5,2012,700 
   ...: 111112,3.5,2011,600 
   ...: 222221,4.0,2012,222
   ...: 222221,4.0,2012,223""")

In [9]: sd=pd.read_csv(s) 
In [10]: sd2=sd.set_index(keys=['col2','year']) 
In [11]: sd2.ix[(4.0,2012)] 
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-1b787a1d99df> in <module>()
----> 1 sd2.ix[(4.0,2012)]

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
     32                 pass
     33 
---> 34             return self._getitem_tuple(key)
     35         else:
     36             return self._getitem_axis(key, axis=0)

...

IndexError: index out of bounds

Is it possible that you have duplicate values in ('col1', 'year')?

I don't know if it is a bug or just a constraint on the MultiIndex (but in that case, the error message could be more clear I think). But you can remove duplicate values before setting the index as follows:

In [21]: sd=pd.read_csv(s) 

In [22]: sd = sd.drop_duplicates(['col2', 'year'])

In [23]: sd2=sd.set_index(keys=['col2','year']) 

In [24]: sd2.ix[(4.0,2012)] 
Out[24]: 
col1       222221
amount        222
Name: (4.0, 2012)

For more information on this, see: http://pandas.pydata.org/pandas-docs/stable/indexing.html#duplicate-data and http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop_duplicates.html.

Yes, that was the issue, thanks a lot for the insight. I was going to use a MultiIndex as a more efficient means of selecting rows of a DataFrame based on multiple columns(see stackoverflow.com/questions/14737566/…), but since the index has to be unique I can't use this approach.

Collectives™ on Stack Overflow

Using MultiIndex on DataFrame

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related