2

I am trying to combine three pandas DataFrames in python. Below are the three DataFrames that I am trying to combine as well as my desired output (where NaN is null). I know that joining the two tables with a left merge did not work. What is the correct sequence of two merges to achieve the desired output? (Does not have to be one line of code if it is not possible) Thanks!

df1
    +--------+
    | x      |
    +--------+
    | 1      |
    | 2      |
    | 3      |
    +--------+

df2
    +--------+---+
    | x      | b |
    +--------+---+
    | 1      | A |
    | 1      | B |
    | 1      | C |
    | 2      | D |
    | 2      | E |
    | 2      | F |
    | 3      | G |
    +--------+---+

df3
    +--------+---+
    | x      | c |
    +--------+---+
    | 1      | L |
    | 1      | M |
    | 2      | N |
    | 3      | O |
    | 3      | P |
    | 3      | Q |
    +--------+---+

df_result
    +----------------+-----+-----+
    | x              |  b  |  c  |
    +----------------+-----+-----+
    | 1              | A   | NaN |
    | 1              | B   | NaN |
    | 1              | C   | NaN |
    | 1              | NaN | L   |
    | 1              | NaN | M   |
    | 2              | D   | NaN |
    | 2              | E   | NaN |
    | 2              | F   | NaN |
    | 2              | NaN | N   |
    | 3              | G   | NaN |
    | 3              | NaN | O   |
    | 3              | NaN | P   |
    | 3              | NaN | Q   |
    +----------------+-----+-----+

The following attempts do not result in the the df_result DataFrame as shown above:

attempt1:
df_step1 = df1.merge(df2, on='x', how='left')
df_result = df_step1.merge(df3, on='x', how='left')
df_result

I have tried the above with a varying combination of left, right, outer and inner joins / merges

attempt2:
df_result = pd.concat([table1, table2, table3], axis=1, sort='false')
df_result

This also does not produce the desired df_result DataFrame.

Perhaps I need a combination of a concat and merge? Or, because it is a new row for each entry basically, I could just write a for loop that enters the information for each of these entries in a new series in the df. Something like this:

for i in range(len(df1.index)):
        for j in range (len(df2.index)):
            df_result = df_result.append(df2[j])

        for k in range (len(df3.index)):
            df_result = df_result.append(df3[k])
9
  • 1
    where is O from Table 2 Commented Feb 11, 2019 at 15:56
  • And why is "x" the only value that only gets 1 row and not 2? Or, is "x", "b", and "c" column names? Commented Feb 11, 2019 at 16:00
  • I suppose you want an analogue to FULL OUTER JOIN, aren't you? Commented Feb 11, 2019 at 16:00
  • Hmm, is pd.concat([df1, df2]) what you want? Commented Feb 11, 2019 at 16:03
  • @coldspeed yes. I did not paste the tables correctly so the formatting in the post is off. The first row should be the column names. I will edit the original post. Commented Feb 11, 2019 at 17:28

2 Answers 2

1

I found that concatenating the second and third dataframes, and then after that, sorting by the x column allowed me to produce a df that matches the expected output defined above in df_result:

df1 = pd.DataFrame({'x': [1,2,3]})
df2 = pd.DataFrame({'x': [1,1,1,2,2,2,3],
                    'b': ['A', 'B', 'C', 'D', 'E', 'F', 'G']})
df3 = pd.DataFrame({'x': [1,1,2,3,3,3],
                    'c': ['L', 'M', 'N', 'O', 'P', 'Q']})


pd.concat([df2, df3], sort=False).sort_values('x').set_index('x', drop=True)

    b   c
x       
1   A   NaN
1   B   NaN
1   C   NaN
1   NaN L
1   NaN M
2   D   NaN
2   E   NaN
2   F   NaN
2   NaN N
3   G   NaN
3   NaN O
3   NaN P
3   NaN Q
Sign up to request clarification or add additional context in comments.

Comments

0

Is it what you need?

import pandas as pd
df2 = pd.DataFrame(data=[(1, 'A'),
 (1, 'B'),
 (1, 'C'),
 (2, 'D'),
 (2, 'E'),
 (3, 'F'),
 (3, 'G')], columns = ("x","b"))

df3 = pd.DataFrame(data=[(1, 'L'),
 (1, 'M'),
 (2, 'N'),
 (3, 'L'),
 (3, 'O'),
 (3, 'P'),
 (3, 'Q')], columns= ("x","c"))

df2["c"] = float('nan')
df3["b"] = float('nan')

df_result=pd.concat((df2,df3), sort=True)[["x","b","c"]]
df_result.sort_values("x")

I got:

   x     b     c
0  1    A  NaN
1  1    B  NaN
2  1    C  NaN
0  1  NaN    L
1  1  NaN    M
3  2    D  NaN
4  2    E  NaN
2  2  NaN    N
5  3    F  NaN
6  3    G  NaN
3  3  NaN    L
4  3  NaN    O
5  3  NaN    P
6  3  NaN    Q

1 Comment

Aha! Yes! Thank you. This solves it. Much appreciated.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.