Spark Scala nested JSON stored as structure table

Question

I have huge no of nested JSON having more than 200 keys want to convert & store in structure table.

  |-- ip_address: string (nullable = true)
  |-- xs_latitude: double (nullable = true)
  |-- Applications: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- b_als_o_isehp: string (nullable = true)
  |    |    |-- b_als_p_isehp: string (nullable = true)
  |    |    |-- b_als_s_isehp: string (nullable = true)
  |    |    |-- l_als_o_eventid: string (nullable = true)
                 ....

Read JSON and get each ip_address having one application array data

 {"ip_address": 1512199720,"Applications": [{"s_pd": -1,"s_path": "NA", "p_pd": "temp0"}, {"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp1"},{"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp2"}],}

val data = spark.read.json("file:///root/users/data/s_json.json")
 var appDf = data.withColumn("data",explode($"Applications")).select($"Applications.s_pd", $"Applications.s_path", $"Applications.p_pd", $"ip_address")
 appDf.printSchema
/// gives 
root
  |-- s_pd: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- s_path: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- p_pd: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- ip_address: string (nullable = true)

In each dataframe record contain an array with duplicate records. How to get the record in table format.

on top of my head you can try appDf.select("ip_addres", "xs_latitude", "Applications.*") to flatten out such a structure. or is it arbitrarily deeply nested? — Dominic Egger
– Dominic Egger, Commented Apr 3, 2018 at 7:44

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Mistake

Your mistake is that you are using the original (Application) struct column to select the nested struct in separate column.

Solution

You had to select from the exploded column which is data

var appDf = data.withColumn("data",explode($"Applications"))
  .select($"ip_address", $"data.s_pd", $"data.s_path", $"data.p_pd")

and you should get

+----------+----+---------+-----+
|ip_address|s_pd|s_path   |p_pd |
+----------+----+---------+-----+
|1512199720|-1  |NA       |temp0|
|1512199720|-1  |root/hdfs|temp1|
|1512199720|-1  |root/hdfs|temp2|
+----------+----+---------+-----+

I hope the answer is helpful

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Apr 3, 2018 at 8:11

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark Scala nested JSON stored as structure table

1 Answer 1

Mistake

Solution

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Mistake

Solution

Comments

Related