Multiple columns with the same name in Pandas

Question

I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.

So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.

Also only one of the values will have a string rest will be None. How can I get that column?

Answering this question: when i do df['a'] which value will it return - it'll return a DF with all columns named a — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Oct 11, 2016 at 21:35
@MaxU it was not returning all values...that is why i was surprised!!!! — vks
– vks, Commented Oct 11, 2016 at 21:36

piRSquared · Accepted Answer · 2016-10-11 21:29:02Z

22

the relevant parameter is mangle_dupe_cols

from the docs

mangle_dupe_cols : boolean, default True
    Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'

by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.

if you used mangle_dupe_cols=False, importing this csv would produce an error.

you can get all of your columns with

df.filter(like='a')

demonstration

from StringIO import StringIO
import pandas as pd

txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""

df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df

df.filter(like='a')

edited Oct 11, 2016 at 21:29

answered Oct 11, 2016 at 21:22

piRSquared

295k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

vks Over a year ago

thanx a lot!!!! works fine..... also only one can have a value in my case rest will be NaN.so how can i pick the only value non nan or nan is all values are nan

piRSquared Over a year ago

@vks This question and answers can be used to get you what you want. If it isn't specific enough for you, ask a new question. stackoverflow.com/q/39903090/2336654

rjurney Over a year ago

I don't see you actually use the parameter mangle_dupe_cols. Why am I the only one confused? :)

abhishah901 Over a year ago

@rjurney you are not the only one. He needs to use it somewhere. By default if it was true then it would've handled the duplicate columns.

abhishah901 Over a year ago

How can I change the column names after importing and preprocessing?

|

JDenman6 · Accepted Answer · 2020-10-08 15:00:59Z

I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.

In : df.columns
Out: 
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')

In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']

In : df.columns
Out: 
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')

From here, I was able to call 'id1' or 'id2' to get just the column I wanted.

Ofir Shorer · Accepted Answer · 2021-04-06 09:23:06Z

That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:

create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):

duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
    if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
        duplicated_columns_list.append(column)
duplicated_columns_list

Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:

for column in duplicated_columns_list:
    list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
    list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'

This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.

This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.

Finally, rename your columns with the underscored elements:

df.columns = list_of_all_columns

That's it, I hope it helps :)

DavidBoja · Accepted Answer · 2021-10-26 08:36:40Z

2

Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id'). Hence, calling

df['id']

returns 2 columns. You can use

df.iloc[:,ind]

where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:

indices = [i for i,x in enumerate(df.columns) if x == 'id']

where you replace 'id' with the name of the column you are searching for.

answered Oct 26, 2021 at 8:36

DavidBoja

311 silver badge3 bronze badges

Collectives™ on Stack Overflow

Multiple columns with the same name in Pandas

4 Answers 4

6 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

Comments

Comments

Linked

Related