0

I am trying to apply a groupBy statement on a set of IDs which are stamped with one or more (usually more) codes. The codes are all in the same column. Then I want to grab only those IDs that do have code1, but not code2.

The dataframe looks as follows:

| ID | Code |
| -------- | -------------- |
| 123| Code1|
| 123| Code2|
| 234| Code1|
| 567| Code1|
| 567| Code2|

I know this can be feasible by using left_anti join as well, but would like to be able to do this also in a groupby Statement

Anybody with any advice?

1 Answer 1

1

You can group by and collect a list of Code, then check the presence of code1 and code2:

import pyspark.sql.functions as F

df2 = df.groupBy('ID').agg(
    (
        F.array_contains(F.collect_list('Code'), 'Code1') & 
        ~F.array_contains(F.collect_list('Code'), 'Code2')
    ).alias('flag')
)

df2.show()
+---+-----+
| ID| flag|
+---+-----+
|234| true|
|567|false|
|123|false|
+---+-----+

df2.filter('flag').show()
+---+----+
| ID|flag|
+---+----+
|234|true|
+---+----+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.