I'm trying to merge two dataframes which contain the same key column. Some of the other columns also have identical headers, although not an equal number of rows, and after merging these columns are "duplicated" with the original headers given a postscript _x, _y, etc.
Does anyone know how to get pandas to drop the duplicate columns in the example below?
This is my python code:
import pandas as pd
holding_df = pd.read_csv('holding.csv')
invest_df = pd.read_csv('invest.csv')
merge_df = pd.merge(holding_df, invest_df, on='key', how='left').fillna(0)
merge_df.to_csv('merged.csv', index=False)
And the CSV files contain this:
First rows of left-dataframe (holding_df)
key, dept_name, res_name, year, need, holding
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1
...
Right-dataframe (invest_df)
key, dept_name, res_name, year, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1000000
DeptA_ResB_2015, DeptA, ResB, 2015, 2, 6000000
DeptB_ResB_2015, DeptB, ResB, 2015, 1, 6000000
...
Merged result
key, dept_name_x, res_name_x, year_x, need, holding, dept_name_y, res_name_y, year_y, no_of_inv, inv_cost_wo_ice
DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1, DeptA, ResA, 2015.0, 1.0, 1000000.0
DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2018, DeptA, ResA, 2018, 1, 1, 0, 0, 0.0, 0.0, 0.0
DeptA_ResA_2019, DeptA, ResA, 2019, 1, 1, 0, 0, 0.0, 0.0, 0.0
...
merge_df = pd.merge(holding_df, invest_df, on=['key', 'dept_name', 'res_name', 'year'], how='left').fillna(0)_xand_ycolumns originate from the left and right frames in the merge. You'll need to specify more columns to indicate that they're the same (pandas doesn't know that).dropbut rename will require passing a dict torename_xcolumns and drop all the_ycolumns, you'll need to usedropandrenameas suggested I can post a dynamic method to do this