Combine Three DataFrames Using Pandas in Python

Question

I am trying to combine three pandas DataFrames in python. Below are the three DataFrames that I am trying to combine as well as my desired output (where NaN is null). I know that joining the two tables with a left merge did not work. What is the correct sequence of two merges to achieve the desired output? (Does not have to be one line of code if it is not possible) Thanks!

df1
    +--------+
    | x      |
    +--------+
    | 1      |
    | 2      |
    | 3      |
    +--------+

df2
    +--------+---+
    | x      | b |
    +--------+---+
    | 1      | A |
    | 1      | B |
    | 1      | C |
    | 2      | D |
    | 2      | E |
    | 2      | F |
    | 3      | G |
    +--------+---+

df3
    +--------+---+
    | x      | c |
    +--------+---+
    | 1      | L |
    | 1      | M |
    | 2      | N |
    | 3      | O |
    | 3      | P |
    | 3      | Q |
    +--------+---+

df_result
    +----------------+-----+-----+
    | x              |  b  |  c  |
    +----------------+-----+-----+
    | 1              | A   | NaN |
    | 1              | B   | NaN |
    | 1              | C   | NaN |
    | 1              | NaN | L   |
    | 1              | NaN | M   |
    | 2              | D   | NaN |
    | 2              | E   | NaN |
    | 2              | F   | NaN |
    | 2              | NaN | N   |
    | 3              | G   | NaN |
    | 3              | NaN | O   |
    | 3              | NaN | P   |
    | 3              | NaN | Q   |
    +----------------+-----+-----+

The following attempts do not result in the the df_result DataFrame as shown above:

attempt1:
df_step1 = df1.merge(df2, on='x', how='left')
df_result = df_step1.merge(df3, on='x', how='left')
df_result

I have tried the above with a varying combination of left, right, outer and inner joins / merges

attempt2:
df_result = pd.concat([table1, table2, table3], axis=1, sort='false')
df_result

This also does not produce the desired df_result DataFrame.

Perhaps I need a combination of a concat and merge? Or, because it is a new row for each entry basically, I could just write a for loop that enters the information for each of these entries in a new series in the df. Something like this:

for i in range(len(df1.index)):
        for j in range (len(df2.index)):
            df_result = df_result.append(df2[j])

        for k in range (len(df3.index)):
            df_result = df_result.append(df3[k])

And why is "x" the only value that only gets 1 row and not 2? Or, is "x", "b", and "c" column names? — cs95
– cs95, Commented Feb 11, 2019 at 16:00
I suppose you want an analogue to FULL OUTER JOIN, aren't you? — Alex Yu
– Alex Yu, Commented Feb 11, 2019 at 16:00
@coldspeed yes. I did not paste the tables correctly so the formatting in the post is off. The first row should be the column names. I will edit the original post. — Jon
– Jon, Commented Feb 11, 2019 at 17:28

James Dellinger · Accepted Answer · 2019-02-11 19:34:36Z

I found that concatenating the second and third dataframes, and then after that, sorting by the x column allowed me to produce a df that matches the expected output defined above in df_result:

df1 = pd.DataFrame({'x': [1,2,3]})
df2 = pd.DataFrame({'x': [1,1,1,2,2,2,3],
                    'b': ['A', 'B', 'C', 'D', 'E', 'F', 'G']})
df3 = pd.DataFrame({'x': [1,1,2,3,3,3],
                    'c': ['L', 'M', 'N', 'O', 'P', 'Q']})


pd.concat([df2, df3], sort=False).sort_values('x').set_index('x', drop=True)

    b   c
x       
1   A   NaN
1   B   NaN
1   C   NaN
1   NaN L
1   NaN M
2   D   NaN
2   E   NaN
2   F   NaN
2   NaN N
3   G   NaN
3   NaN O
3   NaN P
3   NaN Q

Alex Yu · Accepted Answer · 2019-02-11 19:18:59Z

Is it what you need?

import pandas as pd
df2 = pd.DataFrame(data=[(1, 'A'),
 (1, 'B'),
 (1, 'C'),
 (2, 'D'),
 (2, 'E'),
 (3, 'F'),
 (3, 'G')], columns = ("x","b"))

df3 = pd.DataFrame(data=[(1, 'L'),
 (1, 'M'),
 (2, 'N'),
 (3, 'L'),
 (3, 'O'),
 (3, 'P'),
 (3, 'Q')], columns= ("x","c"))

df2["c"] = float('nan')
df3["b"] = float('nan')

df_result=pd.concat((df2,df3), sort=True)[["x","b","c"]]
df_result.sort_values("x")

I got:

   x     b     c
0  1    A  NaN
1  1    B  NaN
2  1    C  NaN
0  1  NaN    L
1  1  NaN    M
3  2    D  NaN
4  2    E  NaN
2  2  NaN    N
5  3    F  NaN
6  3    G  NaN
3  3  NaN    L
4  3  NaN    O
5  3  NaN    P
6  3  NaN    Q

Collectives™ on Stack Overflow

Combine Three DataFrames Using Pandas in Python

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related