Adding missing rows from one dataframe to another based on condition

Question

My sample data is below:

data1 = {'index':  ['001', '001', '001', '002', '002', '003', '004','004'],
        'type' : ['red', 'red', 'red', 'yellow', 'red', 'green', 'blue', 'blue'],
        'class' : ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']}
df1 = pd.DataFrame (data1, columns = ['index', 'type', 'class']) 
df1
    index   type    class
0   001     red     A
1   001     red     A
2   001     red     A
3   002     yellow  A
4   002     red     A
5   003     green   A
6   004     blue    A
7   004     blue    A

data2 = {'index':  ['001', '001', '002', '003', '004'],
        'type' : ['red', 'red', 'yellow', 'green', 'blue'],
        'class' : ['A', 'A', 'A', 'B', 'A'],
        'outcome': ['in', 'in', 'out', 'in', 'out']}
df2 = pd.DataFrame (data2, columns = ['index', 'type', 'class', 'outcome']) 
df2
    index   type    class   outcome
0   001     red     A       in
1   001     red     A       in
2   002     yellow  A       out
3   003     green   B       in
4   004     blue    A       out

In df1, the class = A, in df2 it can be A, B or C. I want to add the missing rows in df2 from df1. df1 has the counts of types for each index. For example if in df1 index 001 appears 3 times it means I should also have it 3 times in df2. For rows from df1 that are not in df2, column outcome should equal NaN. OUTPUT should be:

    index   type    class   outcome
0   001     red     A       in
1   001     red     A       in
2   001     red     A       NaN
3   002     yellow  A       out
4   002     red     A       NaN
5   003     green   A       NaN
6   003     green   B       in
7   004     blue    A       out
8   004     blue    A       NaN

I tried with pd.concat and pd.merge but I kept getting duplicates or wrong rows added. Does someone have an idea of how to do this?

jezrael · Accepted Answer · 2020-07-03 09:31:44Z

Use GroupBy.cumcount for counter values for uniqueness, so possible use outer join by DataFrame.merge in next step:

df1['group'] = df1.groupby(['index','type','class']).cumcount()
df2['group'] = df2.groupby(['index','type','class']).cumcount()

df = (df1.merge(df2, on=['index','type','class','group'], how='outer')
         .sort_values(by=['index', 'class'])
         .drop(columns='group'))
print (df)
  index    type class outcome
0   001     red     A      in
1   001     red     A      in
2   001     red     A     NaN
3   002  yellow     A     out
4   002     red     A     NaN
5   003   green     A     NaN
8   003   green     B      in
6   004    blue     A     out
7   004    blue     A     NaN

Sin-seok Seo · Accepted Answer · 2020-07-03 09:32:58Z

df1['index_id'] = df1.groupby('index').cumcount()
df2['index_id'] = df2.groupby('index').cumcount()

merged = (
    df2
    .merge(df1, how='outer', on=['index', 'type', 'class', 'index_id'])
    .sort_values(by=['index', 'class'])
    .reset_index(drop=True)
    .drop(columns='index_id')
)

print(merged)
    index   type  class outcome
0   001     red    A    in
1   001     red    A    in
2   001     red    A    NaN
3   002     yellow A    out
4   002     red    A    NaN
5   003     green  A    NaN
6   003     green  B    in
7   004     blue   A    out
8   004     blue   A    NaN

Collectives™ on Stack Overflow

Adding missing rows from one dataframe to another based on condition

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related