1

I have a dataframe with N columns and I want to create a new column with the number of columns that have a NULL value. I tried to create an UDF but it's not working because of I can't set an array of parameters.

val simpleData = Seq(
("row1", "NULL" , "NULL" , "NULL" , "NULL" , "NULL", "1"),
("row2", "1", "NULL", "2023", "NULL", "01", "NULL"))


val myDs = simpleData.toDF("row", "field1", "field2",  "field3", "field4", "field5", "field6")
myDs.show()

val windowcols = myDs.columns.filterNot(List("row").contains(_))

def countNullsUDF: UserDefinedFunction = udf { (values: List[String]) =>
  values.filter( value => value == "NULL").length
}

myDs.withColumn("columnsWithNull", countNullsUDF(windowcols)).show(10, false)

is it possible to pass it an Array of columns or similar? I didn't get it.

1 Answer 1

2

Actually what you did is almost correct. You can't pass a list of columns to a UDF, however, you can group all columns into one array column, and then pass that array column to your UDF:

import org.apache.spark.sql.functions.{array, col}

// ...

myDs.withColumn(
  "columnsWithNull",
  countNullsUDF(array(windowcols.map(col): _*))
).show(10, false)

+----+------+------+------+------+------+------+---------------+
|row |field1|field2|field3|field4|field5|field6|columnsWithNull|
+----+------+------+------+------+------+------+---------------+
|row1|NULL  |NULL  |NULL  |NULL  |NULL  |1     |5              |
|row2|1     |NULL  |2023  |NULL  |01    |NULL  |3              |
+----+------+------+------+------+------+------+---------------+

I only needed to transform your list of column names into a list of columns with the .map(col), and to use : _* to expand the list as array takes multiple parameters.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.