I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:
{
“meta”:
{“a”: 1, b: 2”} // I want to ignore meta
“objects”:
[
{
“caucus”: “Progressive”,
“person” :
{
“name”: “Mary”,
“party”: “Green Party”,
“age”: 50,
“gender”: “female” // etc..
}
}, // etc.
]
}
Hence the data looks like this as is, read in with spark.
val df = spark.read.json("file")
df.show()
+--------------------+--------------------+
| meta| objects|
+--------------------+--------------------+
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
+--------------------+--------------------+
Instead of this, I want a DataFrame with the columns: Name | Party | Caucus.
I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.