My sample data is below:
data1 = {'index': ['001', '001', '001', '002', '002', '003', '004','004'],
'type' : ['red', 'red', 'red', 'yellow', 'red', 'green', 'blue', 'blue'],
'class' : ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']}
df1 = pd.DataFrame (data1, columns = ['index', 'type', 'class'])
df1
index type class
0 001 red A
1 001 red A
2 001 red A
3 002 yellow A
4 002 red A
5 003 green A
6 004 blue A
7 004 blue A
data2 = {'index': ['001', '001', '002', '003', '004'],
'type' : ['red', 'red', 'yellow', 'green', 'blue'],
'class' : ['A', 'A', 'A', 'B', 'A'],
'outcome': ['in', 'in', 'out', 'in', 'out']}
df2 = pd.DataFrame (data2, columns = ['index', 'type', 'class', 'outcome'])
df2
index type class outcome
0 001 red A in
1 001 red A in
2 002 yellow A out
3 003 green B in
4 004 blue A out
In df1, the class = A, in df2 it can be A, B or C. I want to add the missing rows in df2 from df1. df1 has the counts of types for each index. For example if in df1 index 001 appears 3 times it means I should also have it 3 times in df2. For rows from df1 that are not in df2, column outcome should equal NaN. OUTPUT should be:
index type class outcome
0 001 red A in
1 001 red A in
2 001 red A NaN
3 002 yellow A out
4 002 red A NaN
5 003 green A NaN
6 003 green B in
7 004 blue A out
8 004 blue A NaN
I tried with pd.concat and pd.merge but I kept getting duplicates or wrong rows added. Does someone have an idea of how to do this?