2

Assume the following two Dataframes in pyspark with equal number of rows:
df1:
 |_ Column1a
 |_ Column1b

df2:
 |_ Column2a
 |_ Column2b

I wish to create a a new DataFrame "df" which has Column1a and Column 2a only. What could be the best possible solution for it?

4
  • Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)? Commented Nov 16, 2016 at 7:00
  • The solution looks at transforming existing Columns in a dataframe or creating a new column, whereas I want to pick Column1a and Column1b to form a new Dataframe. Commented Nov 16, 2016 at 8:30
  • Is the context of the join based on the position? For example, would using the rownumber() approach in this answer work? stackoverflow.com/a/40626348/1100699 Commented Nov 16, 2016 at 18:28
  • I need to give this a try, that might just do it. I'll try it over the weekend. Thanks for the help. I'll get back about how it goes. Commented Nov 18, 2016 at 0:56

1 Answer 1

0

Denny Lee's answer is the way.
It involves creating another column on both the DataFrames which is the Unique_Row_ID for every row. We then perform a join on Unique_Row_ID. And then drop Unique_Row_ID if required.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.