Difference between two DataFrames based on only one column in pyspark [duplicate]

Question

I am looking for a way to find the difference of two DataFrames based on one column. For example:

from pyspark.sql import SQLContext

sc = SparkContext()
sql_context = SQLContext(sc)

df_a = sql_context.createDataFrame([("fa", 3), ("fb", 5), ("fc", 7)], ["first name", "id"])

df_b = sql_context.createDataFrame([("la", 3), ("lb", 10), ("lc", 13)], ["last name", "id"])
DataFrame A:

+----------+---+
|first name| id|
+----------+---+
|        fa|  3|
|        fb|  5|
|        fc|  7|
+----------+---+
DataFrame B:

+---------+---+
|last name| id|
+---------+---+
|       la|  3|
|       lb| 10|
|       lc| 13|
+---------+---+

My goal is to find the difference of DataFrame A and DataFrame B considering column id, the output would be the following DataFrame

    +---------+---+
    |last name| id|
    +---------+---+
    |       lb| 10|
    |       lc| 13|
    +---------+---+

I don't want to use the following method:

a_ids = set(df_a.rdd.map(lambda r: r.id).collect())
df_c = df_b.filter(~col('id').isin(a_ids))

I'm looking for an efficient method (in terms of memory and speed) that I don't have to collect the ids (the size of ids can be billions), maybe something like RDDs SubtractByKey but for DataFrame

PS: I can map df_a to RDD, but I don't want to map df_b to RDD

akuiper · Accepted Answer · 2018-10-02 20:59:34Z

2

You can do a left_anti join on column id:

df_b.join(df_a.select('id'), how='left_anti', on=['id']).show()
+---+---------+
| id|last name|
+---+---------+
| 10|       lb|
| 13|       lc|
+---+---------+

answered Oct 2, 2018 at 20:59

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Difference between two DataFrames based on only one column in pyspark [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related