How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame

Question

I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:

{
  “meta”: 
    {“a”: 1, b: 2”}    // I want to ignore meta
  “objects”:
  [
    {
         “caucus”: “Progressive”,
     “person” : 
         {
          “name”: “Mary”,
          “party”: “Green Party”,
          “age”: 50,
          “gender”: “female” // etc..
         }
    }, // etc.
   ] 
}

Hence the data looks like this as is, read in with spark.

    val df = spark.read.json("file")
    df.show()
+--------------------+--------------------+
|                meta|             objects|
+--------------------+--------------------+
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
+--------------------+--------------------+

Instead of this, I want a DataFrame with the columns: Name | Party | Caucus.

I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.

Abdennacer Lachiheb · Accepted Answer · 2023-01-31 00:03:15Z

1

You can use ".*" on a column of type struct to tranform it it into multiple fields columns:

val df = spark.read.json("file.json")
df.select(col("meta"), explode(col("objects")).as("objects"))
  .select("meta.*", "objects.*")
  .select("a", "b", "caucus", "person.*")
  .show(false)


+---+---+-----------+---+------+----+-----------+
|a  |b  |caucus     |age|gender|name|party      |
+---+---+-----------+---+------+----+-----------+
|1  |2  |Progressive|50 |female|Mary|Green Party|
+---+---+-----------+---+------+----+-----------+

answered Jan 31, 2023 at 0:03

Abdennacer Lachiheb

4,9689 gold badges35 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Roberto Congiu · Accepted Answer · 2023-01-30 22:05:39Z

There's no generic way to handle it because of course it depends on the shape of your data. In your case, you want to explode an array, which will create a column called col, that will contain structs. You can then access the fields within the struct using the dot notation, so to extract the fields you asked for you can do this:

df.select(explode_outer($"objects")).
  select(
     $"col.caucus", 
     $"col.person.name",
     $"col.person.party").show

+-----------+----+-----------+
|     caucus|name|      party|
+-----------+----+-----------+
|Progressive|Mary|Green Party|
+-----------+----+-----------+

Collectives™ on Stack Overflow

How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related