Merging multiple Pandas data-frames on multiple columns

Question

I am trying to merge additional dataframes (DF_B, DF_C) onto DF_A to equal DF_D.

The only way to tie the additional dataframes to DF_A is through column B_2, so I am trying to merge them on B_2. I tried this code below to merge the first additional dataframe (DF_B).

DF_D = pd.merge(DF_A, DF_B, how='left', on='B_2')

This almost worked but it is creating additional columns.

So I thought adding left_on= might work but it did not.

DF_D = pd.merge(DF_A, DF_B, how='left', left_on=['B_2','C_3', 'D_4'])

I'm looking for a way to write additional dataframes over the main dataframe until DF_D is filled out. Also, I would like for DF_D to retain all additional rows and original columns / names even if there is no match during the merge.

Original main dataframe A:

     A_1   B_2 C_3   D_4
0  03/17  3001          
1  03/17  2002   L  BLUE
2  03/17  3777          
3  04/17  5555          
4  04/17  3232          
5  04/17  5000          
6  04/17  5151          
7  05/17  2212   S   RED

Additional dataframe B:

    B_2 C_3    D_4
0  3001   M   GRAY
1  3131   S   BLUE
2  3333  XS  GREEN
3  3232   L   PINK
4  3000   M    RED

Used like:

DF_1 = pd.merge(DF_A, DF_B, how='left', on='B_2')

Additional dataframe C:

    B_2 C_3    D_4
0  5151   S   BLUE
1  5545   M   PINK
2  5555  XL    RED
3  5222   L   GRAY
4  5112   S  GREEN

Used like:

DF_D = pd.merge(DF_1, DF_C, how='left', on='B_2')

Result, final DF_D:

     A_1   B_2 C_3   D_4
0  03/17  3001   M  GRAY
1  03/17  2002   L  BLUE
2  03/17  3777          
3  04/17  5555  XL   RED
4  04/17  3232   L  PINK
5  04/17  5000          
6  04/17  5151   S  BLUE
7  05/17  2212   S   RED

mcskinner · Accepted Answer · 2020-04-18 19:52:29Z

It sounds like you want something like this:

# Make DF_A look like DF_B and DF_C. Same columns, no missing values.
DF_A_filt = DF_A[['B_2', 'C_3', 'D_4']]
DF_A_filt = DF_A_filt[DF_A_filt['C_3'].notnull()]

# Put all the "feature" data together.
df_data = pd.concat([DF_A_filt, DF_B, DF_C], ignore_index=True)

# Drop duplicates by the join key B_2 to keep only the first match.
# This will prefer DF_A, then DF_B, then DF_C.
df_data = df_data.drop_duplicates('B_2')

# Merge the features back onto the keys by B_2.
DF_D = DF_A[['A_1', 'B_2']].merge(df_data, on='B_2', how='left')

The data along the way looks like so:

DF_A_filt                                                                                                                                                                                                                           
#     B_2 C_3   D_4
# 1  2002   L  BLUE
# 7  2212   S   RED

df_data
#      B_2 C_3    D_4
# 0   2002   L   BLUE
# 1   2212   S    RED
# 2   3001   M   GRAY
# 3   3131   S   BLUE
# 4   3333  XS  GREEN
# 5   3232   L   PINK
# 6   3000   M    RED
# 7   5151   S   BLUE
# 8   5545   M   PINK
# 9   5555  XL    RED
# 10  5222   L   GRAY
# 11  5112   S  GREEN

DF_D
     A_1   B_2  C_3   D_4
# 0  03/17  3001    M  GRAY
# 1  03/17  2002    L  BLUE
# 2  03/17  3777  NaN   NaN
# 3  04/17  5555   XL   RED
# 4  04/17  3232    L  PINK
# 5  04/17  5000  NaN   NaN
# 6  04/17  5151    S  BLUE
# 7  05/17  2212    S   RED

Thank you @mcskinner - this solved my problem. Appreciate you laying out the data flow along the way. Your answer helped me better understand how to think about processing data. Cheers!

Parfait · Accepted Answer · 2020-04-18 20:50:08Z

Consider building a list of merged data frame to dfA then bfill across sorted columns, followed by concat + groupby + first:

# MERGE EACH df TO dfA
df_list = [dfA.merge(df, on='___B_2___', how='left', suffixes=['','_']) 
              for df in [dfB, dfC]]

# SORT BY COLUMN NAMES THEN bfill BY ROW
df_list = [df.reindex(sorted(df.columns.to_list()), axis='columns') 
             .bfill(axis=1) for df in df_list]

# CONCAT + GROUPBY + FIRST
final_df = (pd.concat(df_list)
              .reindex(dfA.columns.to_list(), axis='columns')
              .groupby(['A_1', 'B_2'], as_index = False, sort=False)
              .first())

print(final_df)
#          A_1   B_2  C_3   D_4
# 0  __03/17__  3001    M  GRAY
# 1  __03/17__  2002    L  BLUE
# 2  __03/17__  3777  NaN   NaN
# 3  __04/17__  5555   XL   RED
# 4  __04/17__  3232    L  PINK
# 5  __04/17__  5000  NaN   NaN
# 6  __04/17__  5151    S  BLUE
# 7  __05/17__  2212    S   RED

Hi Parfait, thank you for your answer. I'm sure your solution works fine, but I was able to complete my task with mcskinner's answer first. Cheers!

Collectives™ on Stack Overflow

Merging multiple Pandas data-frames on multiple columns

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related