1

How do I do this analysis in PySpark?
Not sure how to this with groupBy:

Input

ID Rating
AAA 1
AAA 2
BBB 3
BBB 2
AAA 2
BBB 2

Output

ID Rating Frequency
AAA 1 1
AAA 2 2
BBB 2 2
BBB 3 1

1 Answer 1

3

You can group by both ID and Rating columns:

import pyspark.sql.functions as F

df2 = df.groupBy('ID', 'Rating').agg(F.count('*').alias('Frequency')).orderBy('ID', 'Rating')
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.