How to create multiple count columns in Pyspark?

Question

I have a dataframe of title and bin:

+---------------------+-------------+
|                Title|          bin|        
+---------------------+-------------+
|  Forrest Gump (1994)|            3|
|  Pulp Fiction (1994)|            2|
|   Matrix, The (1999)|            3|
|     Toy Story (1995)|            1|                     
|    Fight Club (1999)|            3|
+---------------------+-------------+

How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:

+------------+------------+------------+
| count(bin1)| count(bin2)| count(bin3)|      
+------------+------------+------------+
|           1|          1 |           3|
+------------+------------+------------+

Is this possible? Would someone please help me with this if you know how?

blackbishop · Accepted Answer · 2022-01-15 13:23:14Z

2

Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:

import pyspark.sql.functions as F

df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))

df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])

df1.show()

#+----------+----------+----------+
#|count_bin1|count_bin2|count_bin3|
#+----------+----------+----------+
#|         1|         1|         3|
#+----------+----------+----------+

answered Jan 15, 2022 at 13:23

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to create multiple count columns in Pyspark?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related