0

I have a nested JSON where I need to convert into flattened DataFrame without defining or exploding any column names in it.

val df = sqlCtx.read.option("multiLine",true).json("test.json")

So this is how my data looks like :

[
  {
    "symbol": “TEST3",
    "timestamp": "2019-05-07 16:00:00",
    "priceData": {
      "open": "1177.2600",
      "high": "1179.5500",
      "low": "1176.6700",
      "close": "1179.5500",
      "volume": "49478"
    }
  },
  {
    "symbol": “TEST4",
    "timestamp": "2019-05-07 16:00:00",
    "priceData": {
      "open": "189.5660",
      "high": "189.9100",
      "low": "189.5100",
      "close": "189.9100",
      "volume": "267986"
    }
  }
]
2
  • Which version of Spark are you using? Commented May 8, 2019 at 14:02
  • spark version = 2.3.0 Commented May 9, 2019 at 5:17

1 Answer 1

1

Here is one way using the DataFrameFlattener class implemented by Databricks:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{DataType, StructType}

implicit class DataFrameFlattener(df: DataFrame) {
      def flattenSchema: DataFrame = {
        df.select(flatten(Nil, df.schema): _*)
      }

      protected def flatten(path: Seq[String], schema: DataType): Seq[Column] = schema match {
        case s: StructType => s.fields.flatMap(f => flatten(path :+ f.name, f.dataType))
        case other => col(path.map(n => s"`$n`").mkString(".")).as(path.mkString(".")) :: Nil
      }
    }

df.flattenSchema.show

And the output:

+---------------+--------------+-------------+--------------+----------------+------+-------------------+
|priceData.close|priceData.high|priceData.low|priceData.open|priceData.volume|symbol|          timestamp|
+---------------+--------------+-------------+--------------+----------------+------+-------------------+
|      1179.5500|     1179.5500|    1176.6700|     1177.2600|           49478| TEST3|2019-05-07 16:00:00|
|       189.9100|      189.9100|     189.5100|      189.5660|          267986| TEST4|2019-05-07 16:00:00|
+---------------+--------------+-------------+--------------+----------------+------+-------------------+

Or you can just execute a normal select:

df.select(
  "priceData.close", 
  "priceData.high", 
  "priceData.low", 
  "priceData.open", 
  "priceData.volume", 
  "symbol", 
  "timestamp").show

Output:

+---------+---------+---------+---------+------+------+-------------------+
|    close|     high|      low|     open|volume|symbol|          timestamp|
+---------+---------+---------+---------+------+------+-------------------+
|1179.5500|1179.5500|1176.6700|1177.2600| 49478| TEST3|2019-05-07 16:00:00|
| 189.9100| 189.9100| 189.5100| 189.5660|267986| TEST4|2019-05-07 16:00:00|
+---------+---------+---------+---------+------+------+-------------------+
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.