1

I have a dataframe of title and bin:

+---------------------+-------------+
|                Title|          bin|        
+---------------------+-------------+
|  Forrest Gump (1994)|            3|
|  Pulp Fiction (1994)|            2|
|   Matrix, The (1999)|            3|
|     Toy Story (1995)|            1|                     
|    Fight Club (1999)|            3|
+---------------------+-------------+

How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:

+------------+------------+------------+
| count(bin1)| count(bin2)| count(bin3)|      
+------------+------------+------------+
|           1|          1 |           3|
+------------+------------+------------+

Is this possible? Would someone please help me with this if you know how?

1 Answer 1

2

Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:

import pyspark.sql.functions as F

df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))

df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])

df1.show()

#+----------+----------+----------+
#|count_bin1|count_bin2|count_bin3|
#+----------+----------+----------+
#|         1|         1|         3|
#+----------+----------+----------+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.