Merging two pandas dataframes results in "duplicate" columns

Question

I'm trying to merge two dataframes which contain the same key column. Some of the other columns also have identical headers, although not an equal number of rows, and after merging these columns are "duplicated" with the original headers given a postscript _x, _y, etc.

Does anyone know how to get pandas to drop the duplicate columns in the example below?

This is my python code:

import pandas as pd

holding_df = pd.read_csv('holding.csv')
invest_df = pd.read_csv('invest.csv')

merge_df = pd.merge(holding_df, invest_df, on='key', how='left').fillna(0)
merge_df.to_csv('merged.csv', index=False)

And the CSV files contain this:

First rows of left-dataframe (holding_df)

key, dept_name, res_name, year, need, holding
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1
...

Right-dataframe (invest_df)

key, dept_name, res_name, year, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1000000
DeptA_ResB_2015, DeptA, ResB, 2015, 2, 6000000
DeptB_ResB_2015, DeptB, ResB, 2015, 1, 6000000
...

Merged result

key, dept_name_x, res_name_x, year_x, need, holding, dept_name_y, res_name_y, year_y, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1, DeptA, ResA, 2015.0, 1.0, 1000000.0
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2018, DeptA, ResA, 2018, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2019, DeptA, ResA, 2019, 1, 1, 0, 0, 0.0, 0.0, 0.0
...

would adding more columns to merge on still give you the desired result? merge_df = pd.merge(holding_df, invest_df, on=['key', 'dept_name', 'res_name', 'year'], how='left').fillna(0) — EdChum
– EdChum, Commented Dec 5, 2014 at 10:24
The _x and _y columns originate from the left and right frames in the merge. You'll need to specify more columns to indicate that they're the same (pandas doesn't know that). — Simeon Visser
– Simeon Visser, Commented Dec 5, 2014 at 10:27
You can pass a list of columns to drop but rename will require passing a dict to rename — EdChum
– EdChum, Commented Dec 5, 2014 at 10:31
that indicates that the values do not agree or are missing from lhs or rhs, you therefore need to rename the _x columns and drop all the _y columns, you'll need to use drop and rename as suggested I can post a dynamic method to do this — EdChum
– EdChum, Commented Dec 5, 2014 at 11:03
At a glance this seems to be a duplicate of Pandas Merge - How to avoid duplicating columns, but I'm just passing through while trying to find something else so I haven't looked at all the details. — wjandrea
– wjandrea, Commented May 9, 2024 at 17:25

EdChum · Accepted Answer · 2014-12-05 11:54:06Z

The reason you have additional columns with suffixes '_x' and '_y' is because the columns you are merging do not have matching values so this clash produces additional columns. In that case you need to drop the additional '_y' columns and rename the '_x' columns:

In [145]:
# define our drop function
def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)

drop_y(merged)
merged
Out[145]:
               key  dept_name_x  res_name_x   year_x   need   holding  \
0  DeptA_ResA_2015        DeptA        ResA     2015      1         1   
1  DeptA_ResA_2016        DeptA        ResA     2016      1         1   
2  DeptA_ResA_2017        DeptA        ResA     2017      1         1   

    no_of_inv   inv_cost_wo_ice  
0           1           1000000  
1           0                 0  
2           0                 0  
In [146]:
# func to rename '_x' cols
def rename_x(df):
    for col in df:
        if col.endswith('_x'):
            df.rename(columns={col:col.rstrip('_x')}, inplace=True)
rename_x(merged)
merged
Out[146]:
               key  dept_name  res_name   year   need   holding   no_of_inv  \
0  DeptA_ResA_2015      DeptA      ResA   2015      1         1           1   
1  DeptA_ResA_2016      DeptA      ResA   2016      1         1           0   
2  DeptA_ResA_2017      DeptA      ResA   2017      1         1           0   

    inv_cost_wo_ice  
0           1000000  
1                 0  
2                 0

EDIT If you added the common columns to your merge then it shouldn't produce the duplicated columns unless the matches on those columns do not match:

merge_df = pd.merge(holding_df, invest_df, on=['key', 'dept_name', 'res_name', 'year'], how='left').fillna(0)

But they DO have matching values! They have matching keys as well as matching values in exactly the columns which are duplicated, and then two additional columns which are only in the right dataframe and not in the left (hence the merge).
No, that shouldn't happen, duplicate columns only occurs if the keys are not the same so there is something wrong with your data
Dammit, I must have messed up, because adding multiple columns to the on-parameter (as you first suggested in a comment) DOES produce the desired result. So sorry about that, don't know what I did wrong when I tested that myself. If you write your comment as a short answer I will mark it as correct.
I just did this, and I have duplicate _x _y columns, and there are zero matching values. I just dropped it into Excel and created new columns to check if the values in this_x = this_y, and it is TRUE for every single row. So this cannot be correct.

desmond · Accepted Answer · 2015-02-25 16:50:30Z

6

I have the same problem with duplicate columns after left joins even when the columns' data is identical. I did a query and found out that NaN values are considered different even if both columns are NaN in pandas 0.14. BUT once you upgrade to 0.15, this problem disappears, which explains why it later works for you, you probably upgraded.

answered Feb 25, 2015 at 16:50

desmond

2,1014 gold badges24 silver badges31 bronze badges

Comments

fenandosr · Accepted Answer · 2017-04-25 23:35:39Z

1

Not exactly the answer, but pd.merge provides an argument to help you decide which suffixes should be added to your overlapping columns:

merge_df = pd.merge(holding_df, invest_df, on='key', how='left', suffixes=('_holding', '_invest')).fillna(0)

More meaningful names could be helpful if you decide to keep both (or to check why the columns are kept).

See documentation for more reference.

answered Apr 25, 2017 at 23:35

fenandosr

113 bronze badges

Collectives™ on Stack Overflow

Merging two pandas dataframes results in "duplicate" columns

3 Answers 3

4 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Linked

Related