Creating a pyspark.sql.dataframe out of two columns in two different pyspark.sql.dataframes in PySpark

Question

Assume the following two Dataframes in pyspark with equal number of rows:
df1:
|_ Column1a
|_ Column1b

df2:
|_ Column2a
|_ Column2b

I wish to create a a new DataFrame "df" which has Column1a and Column 2a only. What could be the best possible solution for it?

Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)? — user6022341
– user6022341, Commented Nov 16, 2016 at 7:00
The solution looks at transforming existing Columns in a dataframe or creating a new column, whereas I want to pick Column1a and Column1b to form a new Dataframe. — ansh.gandhi
– ansh.gandhi, Commented Nov 16, 2016 at 8:30
Is the context of the join based on the position? For example, would using the rownumber() approach in this answer work? stackoverflow.com/a/40626348/1100699 — Denny Lee
– Denny Lee, Commented Nov 16, 2016 at 18:28
I need to give this a try, that might just do it. I'll try it over the weekend. Thanks for the help. I'll get back about how it goes. — ansh.gandhi
– ansh.gandhi, Commented Nov 18, 2016 at 0:56

ansh.gandhi · Accepted Answer · 2016-12-01 01:03:04Z

0

Denny Lee's answer is the way.
It involves creating another column on both the DataFrames which is the Unique_Row_ID for every row. We then perform a join on Unique_Row_ID. And then drop Unique_Row_ID if required.

answered Dec 1, 2016 at 1:03

ansh.gandhi

821 silver badge11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Creating a pyspark.sql.dataframe out of two columns in two different pyspark.sql.dataframes in PySpark

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related