51

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.

I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:

df1:
+------+----------+--------------------+
|userid|   group  |      all_picks     |
+------+----------+--------------------+
|   348|         2|[225, 2235, 2225]   |
|   567|         1|[1110, 1150]        |
|   595|         1|[1150, 1150, 1150]  |
|   580|         2|[2240, 2225]        |
|   448|         1|[1130]              |
+------+----------+--------------------+

df2:
+------+----------+---------+
|userid|   group  |   pick  |
+------+----------+---------+
|   348|         2|     2270|
|   595|         1|     2125|
+------+----------+---------+

Result I want:
+------+----------+--------------------+
|userid|   group  |      all_picks     |
+------+----------+--------------------+
|   567|         1|[1110, 1150]        |
|   580|         2|[2240, 2225]        |
|   448|         1|[1130]              |
+------+----------+--------------------+

EDIT: I've tried many join() and filter() functions, I believe the closest I got was:

cond = [df1.userid == df2.userid, df2.group == df2.group]
df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows

I tried a bunch of different join types, and I also tried different

cond values:
    cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows
    cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows

However, it seems like the joins are adding additional rows, rather than deleting.

I'm using python 2.7 and spark 2.1.0

0

1 Answer 1

101

Left anti join is what you're looking for:

df1.join(df2, ["userid", "group"], "leftanti")

but the same thing can be done with left outer join:

(df1
    .join(df2, ["userid", "group"], "leftouter")
    .where(df2["pick"].isNull())
    .drop(df2["pick"]))
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.