1

I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:

{
  “meta”: 
    {“a”: 1, b: 2”}    // I want to ignore meta
  “objects”:
  [
    {
         “caucus”: “Progressive”,
     “person” : 
         {
          “name”: “Mary”,
          “party”: “Green Party”,
          “age”: 50,
          “gender”: “female” // etc..
         }
    }, // etc.
   ] 
}

Hence the data looks like this as is, read in with spark.

    val df = spark.read.json("file")
    df.show()
+--------------------+--------------------+
|                meta|             objects|
+--------------------+--------------------+
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
+--------------------+--------------------+

Instead of this, I want a DataFrame with the columns: Name | Party | Caucus.

I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.

2 Answers 2

1

You can use ".*" on a column of type struct to tranform it it into multiple fields columns:

val df = spark.read.json("file.json")
df.select(col("meta"), explode(col("objects")).as("objects"))
  .select("meta.*", "objects.*")
  .select("a", "b", "caucus", "person.*")
  .show(false)


+---+---+-----------+---+------+----+-----------+
|a  |b  |caucus     |age|gender|name|party      |
+---+---+-----------+---+------+----+-----------+
|1  |2  |Progressive|50 |female|Mary|Green Party|
+---+---+-----------+---+------+----+-----------+
Sign up to request clarification or add additional context in comments.

Comments

1

There's no generic way to handle it because of course it depends on the shape of your data. In your case, you want to explode an array, which will create a column called col, that will contain structs. You can then access the fields within the struct using the dot notation, so to extract the fields you asked for you can do this:

df.select(explode_outer($"objects")).
  select(
     $"col.caucus", 
     $"col.person.name",
     $"col.person.party").show

+-----------+----+-----------+
|     caucus|name|      party|
+-----------+----+-----------+
|Progressive|Mary|Green Party|
+-----------+----+-----------+

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.