29

I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.

So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.

Also only one of the values will have a string rest will be None. How can I get that column?

4
  • 1
    Answering this question: when i do df['a'] which value will it return - it'll return a DF with all columns named a Commented Oct 11, 2016 at 21:35
  • @MaxU it was not returning all values...that is why i was surprised!!!! Commented Oct 11, 2016 at 21:36
  • I tested it under Pandas 0.19.0 - what version do you use? Commented Oct 11, 2016 at 21:37
  • @MaxU same....when i use like it works fine Commented Oct 11, 2016 at 21:38

4 Answers 4

22

the relevant parameter is mangle_dupe_cols

from the docs

mangle_dupe_cols : boolean, default True
    Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'

by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.

if you used mangle_dupe_cols=False, importing this csv would produce an error.

you can get all of your columns with

df.filter(like='a')

demonstration

from StringIO import StringIO
import pandas as pd

txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""

df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df

enter image description here

df.filter(like='a')

enter image description here

Sign up to request clarification or add additional context in comments.

6 Comments

thanx a lot!!!! works fine..... also only one can have a value in my case rest will be NaN.so how can i pick the only value non nan or nan is all values are nan
@vks This question and answers can be used to get you what you want. If it isn't specific enough for you, ask a new question. stackoverflow.com/q/39903090/2336654
I don't see you actually use the parameter mangle_dupe_cols. Why am I the only one confused? :)
@rjurney you are not the only one. He needs to use it somewhere. By default if it was true then it would've handled the duplicate columns.
How can I change the column names after importing and preprocessing?
|
3

I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.

In : df.columns
Out: 
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')

In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']

In : df.columns
Out: 
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')

From here, I was able to call 'id1' or 'id2' to get just the column I wanted.

Comments

2

That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:

  1. create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
    if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
        duplicated_columns_list.append(column)
duplicated_columns_list
  1. Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
    list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
    list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'

This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.

This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.

  1. Finally, rename your columns with the underscored elements:

df.columns = list_of_all_columns

That's it, I hope it helps :)

Comments

2

Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id'). Hence, calling

df['id']

returns 2 columns. You can use

df.iloc[:,ind]

where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:

indices = [i for i,x in enumerate(df.columns) if x == 'id']

where you replace 'id' with the name of the column you are searching for.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.