I am looking for a way to find the difference of two DataFrames based on one column. For example:
from pyspark.sql import SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
df_a = sql_context.createDataFrame([("fa", 3), ("fb", 5), ("fc", 7)], ["first name", "id"])
df_b = sql_context.createDataFrame([("la", 3), ("lb", 10), ("lc", 13)], ["last name", "id"])
DataFrame A:
+----------+---+
|first name| id|
+----------+---+
| fa| 3|
| fb| 5|
| fc| 7|
+----------+---+
DataFrame B:
+---------+---+
|last name| id|
+---------+---+
| la| 3|
| lb| 10|
| lc| 13|
+---------+---+
My goal is to find the difference of DataFrame A and DataFrame B considering column id, the output would be the following DataFrame
+---------+---+
|last name| id|
+---------+---+
| lb| 10|
| lc| 13|
+---------+---+
I don't want to use the following method:
a_ids = set(df_a.rdd.map(lambda r: r.id).collect())
df_c = df_b.filter(~col('id').isin(a_ids))
I'm looking for an efficient method (in terms of memory and speed) that I don't have to collect the ids (the size of ids can be billions), maybe something like RDDs SubtractByKey but for DataFrame
PS: I can map df_a to RDD, but I don't want to map df_b to RDD