1

I have two dataframes with the same columns and indices. I would like to combine them into a third dataframe with a hierarchical index, maintaining the current index and adding a second that identifies where each DataFrame came from. This is what I tried:

df_a = pd.DataFrame(randn(3, 2), columns=["x", "y"], index=range(3))
df_b = pd.DataFrame(randn(3, 2), columns=["x", "y"], index=range(3))
tuples = list(itertools.product(["a", "b"], range(3)))
df = pd.DataFrame(columns=["x", "y"], index=pd.MultiIndex.from_tuples(tuples))
df.loc["a"] = df_a
df.loc["b"] = df_b

However, df remains full of NaNs, when I expected it to get filled in with the values from df_a and df_b. This does work:

df.loc["a"] = np.array(df_a)

But seems both roundabout and wrong.

What don't I understand about hierarchical indices? And what is the best way to accomplish my objective?

2 Answers 2

2
In [1]: df_a = pd.DataFrame(randn(3, 2), columns=["x", "y"], index=range(3))

In [2]: df_b = pd.DataFrame(randn(3, 2), columns=["x", "y"], index=range(3))

In [3]: pd.concat([df_a, df_b], keys=['a', 'b'])
Out[3]: 
            x         y
a 0  0.913812 -1.719241
  1  0.544462  0.845426
  2 -0.269518 -1.549679
b 0  0.534311  1.693824
  1  0.119147 -0.171002
  2  0.595658  0.588252
Sign up to request clarification or add additional context in comments.

Comments

1

Another way to achieve this instead of populating the dataframe df, is to add the multi-index to the original arrays (df_a and df_b), and then concatenate them (see below).

The reason df does not get filled is because pandas does data alignment based on the index. And when assigning df.ix["a"] with another dataframe, it fills the values where the indices match. To illustrate this:

>>> df = pd.DataFrame(randn(3, 2), columns=["x", "y"], index=range(3))
>>> df2 = pd.DataFrame(zeros((1, 2)), columns=["x", "y"], index=range(2,3))
>>> df
          x         y
0 -0.995116  0.132438
1 -0.023010 -0.211612
2 -0.053206  0.427369
>>> df2
   x  y
2  0  0
>>> df.ix[:] = df2
>>> df
    x   y
0 NaN NaN
1 NaN NaN
2   0   0

When assigning a numpy array (or a list, ..), there are no indices to match, so it just fills the dataframe (and also broadcast in this case):

>>> df.ix[:] = df2.values
>>> df
   x  y
0  0  0
1  0  0
2  0  0

So, in your case, when you try to assign df_a to df.ix['a'], the indices do not match (MultiIndex vs normal index), and nothing gets assigned (or more exact: filled with NaN's). But when you first convert df_a to also have the same MultiIndex, it does work:

>>> df_a = pd.DataFrame(randn(3, 2), columns=["x", "y"], index=range(3))
>>> df_b = pd.DataFrame(randn(3, 2), columns=["x", "y"], index=range(3))
>>> 
>>> tuples = list(itertools.product(["a", "b"], range(3)))
>>> df = pd.DataFrame(columns=["x", "y"], index=pd.MultiIndex.from_tuples(tuples))
>>> 
>>> df_a.index = pd.MultiIndex.from_tuples([tuple(('a', i)) for i in df_a.index])
>>> 
>>> df.ix["a"] = df_a
>>> df
             x          y
a 0   1.533881   1.276075
  1 -0.5143746 -0.3400633
  2  -1.071509   1.831282
b 0        NaN        NaN
  1        NaN        NaN
  2        NaN        NaN

Or as above, when using a numpy array (the .values attribute returns the data as a numpy array), it does also work:

>>> df.ix["b"] = df_b.values
>>> df
               x          y
a 0     1.533881   1.276075
  1   -0.5143746 -0.3400633
  2    -1.071509   1.831282
b 0   0.06535034 -0.6276186
  1  0.008100781  0.9512881
  2   0.08688541 -0.7101486

But I think, another way to achieve this instead of populating the dataframe df, is to add the multi-index to the original arrays, and then concatenating them:

To convert it to a MultiIndex, you can do it like this:

>>> df_a['df'] = 'a'
>>> df_b['df'] = 'b'
>>> 
>>> df_a = df_a.set_index('df', append=True)
>>> df_b = df_b.set_index('df', append=True)

or like this:

>>> df_a.index = pd.MultiIndex.from_tuples([tuple(('a', i)) for i in df_a.index])
>>> df_b.index = pd.MultiIndex.from_tuples([tuple(('b', i)) for i in df_b.index])

and then you can concatenate them:

>>> df = pd.concat([df_a, df_b])
>>> df
             x         y
  df                    
0 a  -0.225156 -0.846229
1 a   1.566139  0.892763
2 a  -1.291920 -0.517408
0 b   1.464853  0.792709
1 b  -1.307375 -0.360373
2 b   0.467406  1.249325
>>> 
>>> df.swaplevel(0,1)
             x         y
df                      
a  0 -0.225156 -0.846229
   1  1.566139  0.892763
   2 -1.291920 -0.517408
b  0  1.464853  0.792709
   1 -1.307375 -0.360373
   2  0.467406  1.249325

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.