1

I have huge no of nested JSON having more than 200 keys want to convert & store in structure table.

  |-- ip_address: string (nullable = true)
  |-- xs_latitude: double (nullable = true)
  |-- Applications: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- b_als_o_isehp: string (nullable = true)
  |    |    |-- b_als_p_isehp: string (nullable = true)
  |    |    |-- b_als_s_isehp: string (nullable = true)
  |    |    |-- l_als_o_eventid: string (nullable = true)
                 ....

Read JSON and get each ip_address having one application array data

 {"ip_address": 1512199720,"Applications": [{"s_pd": -1,"s_path": "NA", "p_pd": "temp0"}, {"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp1"},{"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp2"}],}

val data = spark.read.json("file:///root/users/data/s_json.json")
 var appDf = data.withColumn("data",explode($"Applications")).select($"Applications.s_pd", $"Applications.s_path", $"Applications.p_pd", $"ip_address")
 appDf.printSchema
/// gives 
root
  |-- s_pd: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- s_path: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- p_pd: array (nullable = true)
  |    |-- element: string (containsNull = true)
  |-- ip_address: string (nullable = true)

In each dataframe record contain an array with duplicate records. How to get the record in table format. enter image description here

1
  • on top of my head you can try appDf.select("ip_addres", "xs_latitude", "Applications.*") to flatten out such a structure. or is it arbitrarily deeply nested? Commented Apr 3, 2018 at 7:44

1 Answer 1

1

Mistake

Your mistake is that you are using the original (Application) struct column to select the nested struct in separate column.

Solution

You had to select from the exploded column which is data

var appDf = data.withColumn("data",explode($"Applications"))
  .select($"ip_address", $"data.s_pd", $"data.s_path", $"data.p_pd")

and you should get

+----------+----+---------+-----+
|ip_address|s_pd|s_path   |p_pd |
+----------+----+---------+-----+
|1512199720|-1  |NA       |temp0|
|1512199720|-1  |root/hdfs|temp1|
|1512199720|-1  |root/hdfs|temp2|
+----------+----+---------+-----+

I hope the answer is helpful

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.