I'm trying to operate on a df with the following data:
+---+----------------------------------------------------+
|ka |readingsWFreq |
+---+----------------------------------------------------+
|列 |[[[列,つ],220], [[列,れっ],353], [[列,れつ],47074]] |
|制 |[[[制,せい],235579]] |
And the following structure:
root
|-- ka: string (nullable = true)
|-- readingsWFreq: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- furigana: struct (nullable = true)
| | | |-- _1: string (nullable = true)
| | | |-- _2: string (nullable = true)
| | |-- Occ: long (nullable = true)
My goal is to split readingsWFreq's values into three different columns. For that purpose I've tried to use udfs as follows:
val uExtractK = udf((kWFreq:Seq[((String, String), Long)]) => kWFreq.map(_._1._1))
val uExtractR = udf((kWFreq:Seq[((String, String), Long)]) => kWFreq.map(_._1._2))
val uExtractN = udf((kWFreq:Seq[((String, String), Long)]) => kWFreq.map(_._2)
val df2 = df.withColumn("K", uExtractK('readingsWFreq))
.withColumn("R", uExtractR('readingsWFreq))
.withColumn("N", uExtractN('readingsWFreq))
.drop('readingsWFreq)
However, I'm getting an exception related to the input parameter of the udfs:
[error] (run-main-0) org.apache.spark.sql.AnalysisException: cannot resolve
'UDF(readingsWFreq)' due to data type mismatch: argument 1 requires
array<struct<_1:struct<_1:string,_2:string>,_2:bigint>> type, however,
'`readingsWFreq`' is of
array<struct<furigana:struct<_1:string,_2:string>,Occ:bigint>> type.;;
My question is, how can I manipulate the dataframe so that it results in the following?
+---+----------------------------------------------------+
|ka |K |R |N |
+---+----------------------------------------------------+
|列 |[列, 列, 列] | [つ, れっ, れつ] | [220, 353, 47074] |
|制 |[制] | [せい] | [235579] |