How do I group by multiple columns and count in PySpark?

Question

How do I do this analysis in PySpark?
Not sure how to this with groupBy:

Input

Output

mck · Accepted Answer · 2021-02-03 09:00:38Z

3

You can group by both ID and Rating columns:

import pyspark.sql.functions as F

df2 = df.groupBy('ID', 'Rating').agg(F.count('*').alias('Frequency')).orderBy('ID', 'Rating')

answered Feb 3, 2021 at 9:00

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1