PySpark : How to aggregate on a column with count of the different

Question

I want to aggregate on the Identifiant column with count of different state and represent all the state.

Identifiant	state
ID01	NY
ID02	NY
ID01	CA
ID03	CA
ID01	CA
ID03	NY
ID01	NY
ID01	CA
ID01	NY

I'd like to obtain this dataset:

Identifiant	NY	CA
ID01	3	3
ID02	1	0
ID03	1	1

blackbishop · Accepted Answer · 2022-02-10 13:16:16Z

1

Group by Identifiant and pivot State column:

from pyspark.sql import functions as F

result = (df.groupBy("Identifiant")
          .pivot("State")
          .count().na.fill(0)
          )

result.show()
#+-----------+---+---+
#|Identifiant| CA| NY|
#+-----------+---+---+
#|       ID03|  1|  1|
#|       ID01|  3|  3|
#|       ID02|  0|  1|
#+-----------+---+---+

edited Feb 10, 2022 at 13:16

answered Feb 10, 2022 at 13:12

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ScootCork Over a year ago

Nice, but you can simplify the code: df.groupby('Identifiant').pivot('State').count().fillna(0)

Collectives™ on Stack Overflow

PySpark : How to aggregate on a column with count of the different

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related