2

There is an array field in dataset like:

my_array:
[
{id: 1, value: x},
{id: 2, value: y}
]

How to make it like:

my_strcut: {
  1: {value: x},
  2: {value: y}
}

I have tried map_from_entries with transform but still have array of structs as output.

UPDATED

There is a dataset which read data from json. Data like that:

{"id":1, ... "arrayOfStructs" : [{"name": "x", "key":"value"}, {"name": "y", "key":"value2"}]}

The output should be something the like:

{"id":1, ... "structsOnly" : { "x": {"name": "x", "key":"value"}}, { "y": {"name": "y", "key":"value2"}}}
1
  • Curious about the ID numbers as column names. Are they the same across all the rows? Spark DF needs a well-defined schema and stable column names. Commented Jan 19, 2022 at 12:04

2 Answers 2

1

I think you want to use MapType not StructType in this case, as struct requires you to know all the values for field id. Something like this using transform + aggregate functions:

val df1 = df.withColumn(
    "structsOnly",
    expr("""aggregate(
              transform(arrayOfStructs, x -> map(x.name, x)), 
              cast(map() as map<string,struct<name:string,key:string>>), 
              (acc, x) -> map_concat(acc, x)
           )
    """)
  ).drop("arrayOfStructs")

df1.printSchema
//root
// |-- id: integer (nullable = false)
// |-- structsOnly: map (nullable = true)
// |    |-- key: string
// |    |-- value: struct (valueContainsNull = true)
// |    |    |-- name: string (nullable = true)
// |    |    |-- key: string (nullable = true)

df1.toJSON.show(false)
//+---------------------------------------------------------------------------------------+
//|value                                                                                  |
//+---------------------------------------------------------------------------------------+
//|{"id":1,"structsOnly":{"x":{"name":"x","key":"value"},"y":{"name":"y","key":"value2"}}}|
//+---------------------------------------------------------------------------------------+

Now, if you really want to have struct type column then you'll need to collect all possible values of field key then construct the the column like this:

val keys = df1.select(map_keys($"structsOnly")).as[Seq[String]].collect.flatten.distinct

val df2 = df1.withColumn(
  "structsOnly",
  struct(keys.map(k => col("structsOnly").getField(k).as(k)): _*)
)
Sign up to request clarification or add additional context in comments.

Comments

1

This may seem like a simple task from the first glance, but not so much...

Using this as input:

case class Strct(id: Int, value: String)
val df = Seq(Seq(Strct(1, "x"), Strct(2, "y"))).toDF("my_array")

print(df.toJSON.head())
// {"my_array":[{"id":1,"value":"x"},{"id":2,"value":"y"}]}

df.printSchema()
// root
//  |-- my_array: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- id: integer (nullable = false)
//  |    |    |-- value: string (nullable = true)

I would create a map in order to extract the schema for subsequent conversion to struct.

val json_col = to_json(aggregate(
    transform($"my_array", x => x.withField("value", x.dropFields("id"))),
    map().cast("map<int,struct<value:string>>"),
    (acc, x) => map_concat(acc, map_from_entries(array(x)))
))
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).alias("my_struct"))

Result:

print(df2.toJSON.head())
// {"my_struct":{"1":{"value":"x"},"2":{"value":"y"}}}

df2.printSchema()
// root
//  |-- my_struct: struct (nullable = true)
//  |    |-- 1: struct (nullable = true)
//  |    |    |-- value: string (nullable = true)
//  |    |-- 2: struct (nullable = true)
//  |    |    |-- value: string (nullable = true)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.