GroupBy count applied to multiple statements for the same column

Question

I am trying to apply a groupBy statement on a set of IDs which are stamped with one or more (usually more) codes. The codes are all in the same column. Then I want to grab only those IDs that do have code1, but not code2.

The dataframe looks as follows:

| ID | Code |
| -------- | -------------- |
| 123| Code1|
| 123| Code2|
| 234| Code1|
| 567| Code1|
| 567| Code2|

I know this can be feasible by using left_anti join as well, but would like to be able to do this also in a groupby Statement

Anybody with any advice?

mck · Accepted Answer · 2021-01-27 08:54:18Z

1

You can group by and collect a list of Code, then check the presence of code1 and code2:

import pyspark.sql.functions as F

df2 = df.groupBy('ID').agg(
    (
        F.array_contains(F.collect_list('Code'), 'Code1') & 
        ~F.array_contains(F.collect_list('Code'), 'Code2')
    ).alias('flag')
)

df2.show()
+---+-----+
| ID| flag|
+---+-----+
|234| true|
|567|false|
|123|false|
+---+-----+

df2.filter('flag').show()
+---+----+
| ID|flag|
+---+----+
|234|true|
+---+----+

answered Jan 27, 2021 at 8:54

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

GroupBy count applied to multiple statements for the same column

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related