How to filter out rows from pyspark dataframe based on values of columns of different dataframe

Question

I have a PySpark dataframe df1-

data1 = [(("u1", 'w1', 20)), (("u2", 'w1', 30)), (("u3", 'w2', 40))]
df1 = spark.createDataFrame(data, ["ID", "week", "var"])
df1.show()

+---+----+---+
| ID|week|var|
+---+----+---+
| u1|  w1| 20|
| u2|  w1| 30|
| u3|  w2| 40|
+---+----+---+

I have another PySpark dataframe, df2-

data2 = [(("u1", 'w1', 20)), (("u1", 'w2', 10)), (("u2", 'w1', 30)), (("u3", 'w2', 40)), (("u3", 'w2', 50)), (("u4", 'w1', 100)), (("u4", 'w2', 0))]
df2 = spark.createDataFrame(data2, ["ID", "week", "var"])
df2.show()

+---+----+---+
| ID|week|var|
+---+----+---+
| u1|  w1| 20|
| u1|  w2| 10|
| u2|  w1| 30|
| u3|  w2| 40|
| u3|  w2| 50|
| u4|  w1|100|
| u4|  w2|  0|
+---+----+---+

I only want to keep the rows of df2 for which df2.ID is present in df1.ID

The desired output is-

+---+----+---+
| ID|week|var|
+---+----+---+
| u1|  w1| 20|
| u1|  w2| 10|
| u2|  w1| 30|
| u3|  w2| 40|
| u3|  w2| 50|
+---+----+---+

How can I get this done?

I kind of solved the problem but I am not sure if it is the right way to do it. Can someone please review my solution? ` df_new = df1.join(df2, df1['ID']==df2['ID'], 'inner').select(df2.ID, df2.week, df2.var) df_new.show() ` — n0obcoder
– n0obcoder, Commented Dec 10, 2020 at 5:51

Anand Vidvat · Accepted Answer · 2020-12-10 06:08:04Z

2

you can use left_semi join for this kind of matching record condition.

df3 = df2.join(d1,df2.ID == df1.ID, 'left_semi')

df3 will contain all the records of df2 (all columns) which have a matching composite key in df1.

edited Dec 10, 2020 at 6:08

answered Dec 10, 2020 at 5:50

Anand Vidvat

1,0889 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to filter out rows from pyspark dataframe based on values of columns of different dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related