How to create a PySpark DataFrame from 2 columns of 2 DataFrames?

Question

I have the following 2 PySpark DataFrames, both with the same number of rows (say 100 rows):

df1:
 |_ Column_a
 |_ Column_b

df2:
 |_ Column_c
 |_ Column_d

How do I create df_final which has 100 rows and the following columns?:

df_final:
 |_ Column_a
 |_ Column_b
 |_ Column_c
 |_ Column_d

I looked at concat(), join(), union() but I don't think that's right.

you need a common field to join them

drum
– drum

2021-08-11 00:45:51 +00:00
Commented Aug 11, 2021 at 0:45 — drum
– drum, Commented Aug 11, 2021 at 0:45

Bala · Accepted Answer · 2021-08-11 12:35:29Z

2

Try zip

>>> df1.show()
+---+---+
|  a|  b|
+---+---+
|  2|  3|
|  4|  5|
+---+---+

>>> df2.show()
+---+---+
|  c|  d|
+---+---+
| 20| 30|
| 40| 50|
+---+---+

>>> df1.rdd.zip(df2.rdd).map(lambda x: (x[0][0],x[0][1],x[1][0],x[1][1])).toDF(['a','b','c','d']).show()
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  2|  3| 20| 30|
|  4|  5| 40| 50|
+---+---+---+---+

answered Aug 11, 2021 at 12:35

Bala

11.3k19 gold badges74 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

puifais Over a year ago

This works, thank you! I did not quite comprehend the mapping part. But through trial and error, I was able to join my df1 (3 columns) and df2 (3 columns) with map(lambda x: (x[0][0],x[0][1],x[0][2],x[1][0],x[1][1],x[1][2])). Sharing in case the pattern is helpful to others.

Collectives™ on Stack Overflow

How to create a PySpark DataFrame from 2 columns of 2 DataFrames?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related