14

I am using python2.7 and pandas 0.11.0.

I try to fill a column of a dataframe using DataFrame.apply(func). The func() function is supposed to return a numpy array (1x3).

import pandas as pd
import numpy as np

df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
print(df)

              A         B         C
    0  0.910142  0.788300  0.114164
    1 -0.603282 -0.625895  2.843130
    2  1.823752 -0.091736 -0.107781
    3  0.447743 -0.163605  0.514052

The function used for testing purpose:

def test(row):
   # some complex calc here 
   # based on the values from different columns 
   return np.array((1,2,3))

df['D'] = df.apply(test, axis=1)

[...]
ValueError: Wrong number of items passed 1, indices imply 3

The funny is that when I create the dataframe from scratch, it works pretty well, and returns as expected:

dic = {'A': {0: 0.9, 1: -0.6, 2: 1.8, 3: 0.4}, 
     'C': {0: 0.1, 1: 2.8, 2: -0.1, 3: 0.5}, 
     'B': {0: 0.7, 1: -0.6, 2: -0.1, 3: -0.1},
     'D': {0:np.array((1,2,3)), 
          1:np.array((1,2,3)), 
          2:np.array((1,2,3)), 
          3:np.array((1,2,3))}}

df= pd.DataFrame(dic)
print(df)
         A    B    C          D
    0  0.9  0.7  0.1  [1, 2, 3]
    1 -0.6 -0.6  2.8  [1, 2, 3]
    2  1.8 -0.1 -0.1  [1, 2, 3]
    3  0.4 -0.1  0.5  [1, 2, 3]

Thanks in advance

2
  • 3
    You should avoid using lists/tuples in DataFrames or Series. Why not just have 3 columns in df or a separate DataFrame with your columns? Commented Sep 5, 2013 at 16:49
  • 8
    I guess sometimes vector form is more natural for some quantity, e.g., coordinates. df.endPoint-df.startPoint is obviously more preferable to np.c_[df.endX-df.startX, df.endY-df.startY, df.endZ-df.startZ]. Commented Oct 29, 2013 at 5:36

1 Answer 1

14

If you try to return multiple values from the function that is passed to apply, and the DataFrame you call the apply on has the same number of item along the axis (in this case columns) as the number of values you returned, Pandas will create a DataFrame from the return values with the same labels as the original DataFrame. You can see this if you just do:

>>> def test(row):
        return [1, 2, 3]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
   A  B  C
0  1  2  3
1  1  2  3
2  1  2  3
3  1  2  3

And that is why you get the error, since you cannot assign a DataFrame to DataFrame column.

If you return any other number of values, it will return just a series object, that can be assigned:

>>> def test(row):
       return [1, 2]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
3    [1, 2]
>>> df['D'] = df.apply(test, axis=1)
>>> df
          A         B         C       D
0  0.333535  0.209745 -0.972413  [1, 2]
1  0.469590  0.107491 -1.248670  [1, 2]
2  0.234444  0.093290 -0.853348  [1, 2]
3  1.021356  0.092704 -0.406727  [1, 2]

I'm not sure why Pandas does this, and why it does it only when the return value is a list or an ndarray, since it won't do it if you return a tuple:

>>> def test(row):
        return (1, 2, 3)
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df['D'] = df.apply(test, axis=1)
>>> df
          A         B         C          D
0  0.121136  0.541198 -0.281972  (1, 2, 3)
1  0.569091  0.944344  0.861057  (1, 2, 3)
2 -1.742484 -0.077317  0.181656  (1, 2, 3)
3 -1.541244  0.174428  0.660123  (1, 2, 3)
Sign up to request clarification or add additional context in comments.

7 Comments

Hi Viktor! thanks to answer. So if I understand you correctly ,there is no way to pass a numpy array?
@Nic If the length of the numpy array is not the same as the number of columns your code will work, but it's not intended to be used in such a way. As Phillip Cloud said you should avoid placing lists or arrays in your Series. You should create multiple Series (that is, multiple columns in your DataFrame).
Thanks guys. I'll then follow your advice, and go for 3 columns. @Phillip: sorry I missed your comment at first reading.
I wish to keep some array in the same dataframe, I wish there was a supported way to do this.
Is there any alternative to pandas that would work ? I don't understand the point of not letting users choosing what object they want to put inside a dataframe.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.