8

Can anyone help to explain why I get errors in some actions and not others when there is a duplicate column in a pandas.DataFrame.

Minimal, Reproducible Example

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'b'])

If I try and insert a list into column 'a' I get an error about dimension mis-match:

df.loc[:, 'a'] = list(range(5))

Traceback (most recent call last):
...
ValueError: cannot copy sequence with size 5 to array axis with dimension 0

Similar with 'b':

df.loc[:, 'b'] = list(range(5))

Traceback (most recent call last):
...
ValueError: could not broadcast input array from shape (5) into shape (0,2)

However if I insert into an entirely new column, I don't get an error, unless I insert into 'a' or 'b':

df.loc[:, 'c'] = list(range(5))
print(df)

     a    b    b  c
0  NaN  NaN  NaN  0
1  NaN  NaN  NaN  1
2  NaN  NaN  NaN  2
3  NaN  NaN  NaN  3
4  NaN  NaN  NaN  4

df.loc[:, 'a'] = list(range(5))

Traceback (most recent call last):
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 0)

All of these errors disappear if I remove the duplicate column 'b'


Additional information

pandas==1.0.2

14
  • It is the duplicate column name - see it asked here: stackoverflow.com/questions/27065133/… Commented Dec 11, 2020 at 16:43
  • 2
    My guess is creating a new column first creates a Series, then joins it to the dataframe. Assigned existing columns attempts to put value in the pre-allocated positions Commented Dec 11, 2020 at 16:53
  • 1
    @AmyChodorowski - How working df['a'] = list(range(5)) and df['b'] = list(range(1,6)) ? Commented Dec 15, 2020 at 11:14
  • 1
    This issue still exists in 1.3.0 - although the error message is slightly different ValueError: cannot copy sequence with size 5 to array axis with dimension 0. I guess I'll open an issue for this. Commented Dec 16, 2020 at 12:27
  • 1
    For anyone interested to track the issue: github.com/pandas-dev/pandas/issues/38521 Commented Dec 16, 2020 at 13:18

1 Answer 1

1

Why use loc and not just:

df['a'] = list(range(5))

This gives no error and seems to produce what you need:

a   b   b
0   NaN NaN 
1   NaN NaN 
2   NaN NaN 
3   NaN NaN 
4   NaN NaN 

same for creating column c:

df['c'] = list(range(5))
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.