0

I am trying to merge additional dataframes (DF_B, DF_C) onto DF_A to equal DF_D.

The only way to tie the additional dataframes to DF_A is through column B_2, so I am trying to merge them on B_2. I tried this code below to merge the first additional dataframe (DF_B).

DF_D = pd.merge(DF_A, DF_B, how='left', on='B_2') 

This almost worked but it is creating additional columns.

So I thought adding left_on= might work but it did not.

DF_D = pd.merge(DF_A, DF_B, how='left', left_on=['B_2','C_3', 'D_4'])


I'm looking for a way to write additional dataframes over the main dataframe until DF_D is filled out. Also, I would like for DF_D to retain all additional rows and original columns / names even if there is no match during the merge.

Original main dataframe A:

     A_1   B_2 C_3   D_4
0  03/17  3001          
1  03/17  2002   L  BLUE
2  03/17  3777          
3  04/17  5555          
4  04/17  3232          
5  04/17  5000          
6  04/17  5151          
7  05/17  2212   S   RED

Additional dataframe B:

    B_2 C_3    D_4
0  3001   M   GRAY
1  3131   S   BLUE
2  3333  XS  GREEN
3  3232   L   PINK
4  3000   M    RED

Used like:

DF_1 = pd.merge(DF_A, DF_B, how='left', on='B_2')

Additional dataframe C:

    B_2 C_3    D_4
0  5151   S   BLUE
1  5545   M   PINK
2  5555  XL    RED
3  5222   L   GRAY
4  5112   S  GREEN

Used like:

DF_D = pd.merge(DF_1, DF_C, how='left', on='B_2')

Result, final DF_D:

     A_1   B_2 C_3   D_4
0  03/17  3001   M  GRAY
1  03/17  2002   L  BLUE
2  03/17  3777          
3  04/17  5555  XL   RED
4  04/17  3232   L  PINK
5  04/17  5000          
6  04/17  5151   S  BLUE
7  05/17  2212   S   RED

2 Answers 2

1

It sounds like you want something like this:

# Make DF_A look like DF_B and DF_C. Same columns, no missing values.
DF_A_filt = DF_A[['B_2', 'C_3', 'D_4']]
DF_A_filt = DF_A_filt[DF_A_filt['C_3'].notnull()]

# Put all the "feature" data together.
df_data = pd.concat([DF_A_filt, DF_B, DF_C], ignore_index=True)

# Drop duplicates by the join key B_2 to keep only the first match.
# This will prefer DF_A, then DF_B, then DF_C.
df_data = df_data.drop_duplicates('B_2')

# Merge the features back onto the keys by B_2.
DF_D = DF_A[['A_1', 'B_2']].merge(df_data, on='B_2', how='left')

The data along the way looks like so:

DF_A_filt                                                                                                                                                                                                                           
#     B_2 C_3   D_4
# 1  2002   L  BLUE
# 7  2212   S   RED

df_data
#      B_2 C_3    D_4
# 0   2002   L   BLUE
# 1   2212   S    RED
# 2   3001   M   GRAY
# 3   3131   S   BLUE
# 4   3333  XS  GREEN
# 5   3232   L   PINK
# 6   3000   M    RED
# 7   5151   S   BLUE
# 8   5545   M   PINK
# 9   5555  XL    RED
# 10  5222   L   GRAY
# 11  5112   S  GREEN

DF_D
     A_1   B_2  C_3   D_4
# 0  03/17  3001    M  GRAY
# 1  03/17  2002    L  BLUE
# 2  03/17  3777  NaN   NaN
# 3  04/17  5555   XL   RED
# 4  04/17  3232    L  PINK
# 5  04/17  5000  NaN   NaN
# 6  04/17  5151    S  BLUE
# 7  05/17  2212    S   RED
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @mcskinner - this solved my problem. Appreciate you laying out the data flow along the way. Your answer helped me better understand how to think about processing data. Cheers!
1

Consider building a list of merged data frame to dfA then bfill across sorted columns, followed by concat + groupby + first:

# MERGE EACH df TO dfA
df_list = [dfA.merge(df, on='___B_2___', how='left', suffixes=['','_']) 
              for df in [dfB, dfC]]

# SORT BY COLUMN NAMES THEN bfill BY ROW
df_list = [df.reindex(sorted(df.columns.to_list()), axis='columns') 
             .bfill(axis=1) for df in df_list]

# CONCAT + GROUPBY + FIRST
final_df = (pd.concat(df_list)
              .reindex(dfA.columns.to_list(), axis='columns')
              .groupby(['A_1', 'B_2'], as_index = False, sort=False)
              .first())

print(final_df)
#          A_1   B_2  C_3   D_4
# 0  __03/17__  3001    M  GRAY
# 1  __03/17__  2002    L  BLUE
# 2  __03/17__  3777  NaN   NaN
# 3  __04/17__  5555   XL   RED
# 4  __04/17__  3232    L  PINK
# 5  __04/17__  5000  NaN   NaN
# 6  __04/17__  5151    S  BLUE
# 7  __05/17__  2212    S   RED

1 Comment

Hi Parfait, thank you for your answer. I'm sure your solution works fine, but I was able to complete my task with mcskinner's answer first. Cheers!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.