0

I have a PySpark dataframe df1-

data1 = [(("u1", 'w1', 20)), (("u2", 'w1', 30)), (("u3", 'w2', 40))]
df1 = spark.createDataFrame(data, ["ID", "week", "var"])
df1.show()

+---+----+---+
| ID|week|var|
+---+----+---+
| u1|  w1| 20|
| u2|  w1| 30|
| u3|  w2| 40|
+---+----+---+

I have another PySpark dataframe, df2-

data2 = [(("u1", 'w1', 20)), (("u1", 'w2', 10)), (("u2", 'w1', 30)), (("u3", 'w2', 40)), (("u3", 'w2', 50)), (("u4", 'w1', 100)), (("u4", 'w2', 0))]
df2 = spark.createDataFrame(data2, ["ID", "week", "var"])
df2.show()

+---+----+---+
| ID|week|var|
+---+----+---+
| u1|  w1| 20|
| u1|  w2| 10|
| u2|  w1| 30|
| u3|  w2| 40|
| u3|  w2| 50|
| u4|  w1|100|
| u4|  w2|  0|
+---+----+---+

I only want to keep the rows of df2 for which df2.ID is present in df1.ID

The desired output is-

+---+----+---+
| ID|week|var|
+---+----+---+
| u1|  w1| 20|
| u1|  w2| 10|
| u2|  w1| 30|
| u3|  w2| 40|
| u3|  w2| 50|
+---+----+---+

How can I get this done?

1
  • I kind of solved the problem but I am not sure if it is the right way to do it. Can someone please review my solution? ` df_new = df1.join(df2, df1['ID']==df2['ID'], 'inner').select(df2.ID, df2.week, df2.var) df_new.show() ` Commented Dec 10, 2020 at 5:51

1 Answer 1

2

you can use left_semi join for this kind of matching record condition.

df3 = df2.join(d1,df2.ID == df1.ID, 'left_semi')

df3 will contain all the records of df2 (all columns) which have a matching composite key in df1.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.