1

I have the following 2 PySpark DataFrames, both with the same number of rows (say 100 rows):

df1:
 |_ Column_a
 |_ Column_b

df2:
 |_ Column_c
 |_ Column_d

How do I create df_final which has 100 rows and the following columns?:

df_final:
 |_ Column_a
 |_ Column_b
 |_ Column_c
 |_ Column_d

I looked at concat(), join(), union() but I don't think that's right.

1
  • you need a common field to join them Commented Aug 11, 2021 at 0:45

1 Answer 1

2

Try zip

>>> df1.show()
+---+---+
|  a|  b|
+---+---+
|  2|  3|
|  4|  5|
+---+---+

>>> df2.show()
+---+---+
|  c|  d|
+---+---+
| 20| 30|
| 40| 50|
+---+---+

>>> df1.rdd.zip(df2.rdd).map(lambda x: (x[0][0],x[0][1],x[1][0],x[1][1])).toDF(['a','b','c','d']).show()
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  2|  3| 20| 30|
|  4|  5| 40| 50|
+---+---+---+---+
Sign up to request clarification or add additional context in comments.

1 Comment

This works, thank you! I did not quite comprehend the mapping part. But through trial and error, I was able to join my df1 (3 columns) and df2 (3 columns) with map(lambda x: (x[0][0],x[0][1],x[0][2],x[1][0],x[1][1],x[1][2])). Sharing in case the pattern is helpful to others.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.