Pyspark question making count result into a dataframe

Question

I have a pyspark function that looks like this. \

spark.sql("select count(*) from student_table where student_id is NULL") \
spark.sql("select count(*) from student_table where student_scores is NULL") \
spark.sql("select count(*) from student_table where student_health is NULL")

I get a result that looks like \

+-------+|count(1)|\n-------+|    0|+-------+\n|count(1)|\n-------+|    100|+-------+\n|count(1)|\n-------+|  24145|

What I want to do is to make the result into a dataframe for each column by using pandas or pyspark function. The result should have each null value result for each column.
For example,

Thanks in advance if someone can help me out.

blackbishop · Accepted Answer · 2021-11-07 16:14:21Z

You could use union between the 3 queries but actually you can get all null counts for each column using one query:

spark.sql("""
    SELECT  SUM(INT(student_id IS NULL))     AS student_id_nb_null,
            SUM(INT(student_scores IS NULL)) AS student_scores_nb_null,
            SUM(INT(student_health IS NULL)) AS student_health_nb_null,
    FROM    student_table 
""").show()

#+------------------+----------------------+----------------------+
#|student_id_nb_null|student_scores_nb_null|student_health_nb_null|
#+------------------+----------------------+----------------------+
#|                 0|                   100|                 24145|
#+------------------+----------------------+----------------------+

Or by using DataFrame API with:

import pyspark.sql.functions as F    

df.agg(
    F.sum(F.col("student_id").isNull().cast("int")).alias("student_id_nb_null"),
    F.sum(F.col("student_scores").isNull().cast("int")).alias("student_scores_nb_null"),
    F.sum(F.col("student_health").isNull().cast("int")).alias("student_health_nb_null")
)

notNull · Accepted Answer · 2021-11-07 14:10:20Z

0

Use union all and add all your queries in one spark.sql.

Example:

spark.sql("""select "student_id" `column_name`,count(*) `null_result` from tmp where student_id is null \
union all \
select "student_scores" `column_name`,count(*) `null_result` from tmp where student_scores is null \
union all \
select "student_health" `column_name`,count(*) `null_result` from tmp where student_health is null""").\
show()

answered Nov 7, 2021 at 14:10

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Collectives™ on Stack Overflow

Pyspark question making count result into a dataframe

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related