0

I am looking for a way to find the difference of two DataFrames based on one column. For example:

from pyspark.sql import SQLContext

sc = SparkContext()
sql_context = SQLContext(sc)

df_a = sql_context.createDataFrame([("fa", 3), ("fb", 5), ("fc", 7)], ["first name", "id"])

df_b = sql_context.createDataFrame([("la", 3), ("lb", 10), ("lc", 13)], ["last name", "id"])
DataFrame A:

+----------+---+
|first name| id|
+----------+---+
|        fa|  3|
|        fb|  5|
|        fc|  7|
+----------+---+
DataFrame B:

+---------+---+
|last name| id|
+---------+---+
|       la|  3|
|       lb| 10|
|       lc| 13|
+---------+---+

My goal is to find the difference of DataFrame A and DataFrame B considering column id, the output would be the following DataFrame

    +---------+---+
    |last name| id|
    +---------+---+
    |       lb| 10|
    |       lc| 13|
    +---------+---+

I don't want to use the following method:

a_ids = set(df_a.rdd.map(lambda r: r.id).collect())
df_c = df_b.filter(~col('id').isin(a_ids))

I'm looking for an efficient method (in terms of memory and speed) that I don't have to collect the ids (the size of ids can be billions), maybe something like RDDs SubtractByKey but for DataFrame

PS: I can map df_a to RDD, but I don't want to map df_b to RDD

0

1 Answer 1

2

You can do a left_anti join on column id:

df_b.join(df_a.select('id'), how='left_anti', on=['id']).show()
+---+---------+
| id|last name|
+---+---------+
| 10|       lb|
| 13|       lc|
+---+---------+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.