PySpark: GroupBy and count the sum of unique values for a column

Question

I have a device_id's event data which might be successful sometime and unsuccessful sometime.

device_id	status
1	Successful
1	UnSuccessful
1	UnSuccessful
1	UnSuccessful
1	Successful
2	Successful
2	UnSuccessful
2	UnSuccessful

Is there a way to do a group by and get result for an Id in a single row like this:

device_id	success_count	unsuccessful_count
1	2	3
2	1	2

I have been trying several ways using group by but I haven't been able to get the success_count and unsuccessful_count for a device_id in single row.

Anna K. · Accepted Answer · 2021-12-05 05:28:12Z

3

You need to group your data by device id and then pivot by status and count:

df.groupBy("device_id").pivot("status").count()

answered Dec 5, 2021 at 5:28

Anna K.

1,57214 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Raghavi Ravi Over a year ago

I somehow get classCast exception: Py4JJavaError: An error occurred while calling o1604.showString. : java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')... Any Idea why it might happen ?

Raghavi Ravi Over a year ago

I'm accepting the answer since the above issue that I see is just related to my usecase and the solution actually does what I had asked for .

Collectives™ on Stack Overflow

PySpark: GroupBy and count the sum of unique values for a column

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related