0

I have a pyspark function that looks like this. \

spark.sql("select count(*) from student_table where student_id is NULL") \
spark.sql("select count(*) from student_table where student_scores is NULL") \
spark.sql("select count(*) from student_table where student_health is NULL")

I get a result that looks like \

+-------+|count(1)|\n-------+|    0|+-------+\n|count(1)|\n-------+|    100|+-------+\n|count(1)|\n-------+|  24145|

What I want to do is to make the result into a dataframe for each column by using pandas or pyspark function. The result should have each null value result for each column.
For example,

enter image description here

Thanks in advance if someone can help me out.

2 Answers 2

1

You could use union between the 3 queries but actually you can get all null counts for each column using one query:

spark.sql("""
    SELECT  SUM(INT(student_id IS NULL))     AS student_id_nb_null,
            SUM(INT(student_scores IS NULL)) AS student_scores_nb_null,
            SUM(INT(student_health IS NULL)) AS student_health_nb_null,
    FROM    student_table 
""").show()

#+------------------+----------------------+----------------------+
#|student_id_nb_null|student_scores_nb_null|student_health_nb_null|
#+------------------+----------------------+----------------------+
#|                 0|                   100|                 24145|
#+------------------+----------------------+----------------------+

Or by using DataFrame API with:

import pyspark.sql.functions as F    

df.agg(
    F.sum(F.col("student_id").isNull().cast("int")).alias("student_id_nb_null"),
    F.sum(F.col("student_scores").isNull().cast("int")).alias("student_scores_nb_null"),
    F.sum(F.col("student_health").isNull().cast("int")).alias("student_health_nb_null")
)
Sign up to request clarification or add additional context in comments.

Comments

0

Use union all and add all your queries in one spark.sql.

Example:

spark.sql("""select "student_id" `column_name`,count(*) `null_result` from tmp where student_id is null \
union all \
select "student_scores" `column_name`,count(*) `null_result` from tmp where student_scores is null \
union all \
select "student_health" `column_name`,count(*) `null_result` from tmp where student_health is null""").\
show()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.