5

I am very new to Pyspark. I tried parsing the JSON file using the following code

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:///home/malwarehunter/Downloads/122116-path.json")
df.printSchema()

The output is as follows.

root |-- _corrupt_record: string (nullable = true)

df.show()

The output looks like this

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|  "time1":"2...|
|  "time2":"201...|
|    "step":0.5,|
|          "xyz":[|
|                   {|
|      "student":"00010...|
|      "attr...|
|        [ -2.52, ...|
|        [ -2.3, -...|
|        [ -1.97, ...|
|        [ -1.27, ...|
|        [ -1.03, ...|
|        [ -0.8, -...|
|        [ -0.13, ...|
|        [ 0.09, -...|
|        [ 0.54, -...|
|        [  1.1, -...|
|        [ 1.34, 0...|
|        [ 1.64, 0...|
+--------------------+
only showing top 20 rows

The Json File looks like this.

{
  "time1":"2016-12-16T00:00:00.000",

  "time2":"2016-12-16T23:59:59.000",

  "step":0.5,

   "xyz":[

    {
     "student":"0001025D0007F5DB",
      "attr":[
    [ -2.52, -1.17 ],
    [ -2.3, -1.15 ],
    [ -1.97, -1.19 ],
    [ 10.16, 4.08 ],
    [ 10.23, 4.87 ],
    [ 9.96, 5.09 ] ]
},
{
  "student":"0001025D0007F5DC",
  "attr":[
    [ -2.58, -0.99 ],
    [ 10.12, 3.89 ],
    [ 10.27, 4.59 ],
    [ 10.05, 5.02 ] ]
}
]}

Could you help me on parsing this and creating a Data Frame like this.

Output Dataframe required

1
  • 1
    The json appears to be multi line per object. If so this is not supported by spark (it assumes a single line per object). Commented Jan 9, 2017 at 8:40

2 Answers 2

17

Spark >= 2.2:

You can use multiLine argument for JSON reader:

spark.read.json(path_to_input, multiLine=True)

Spark < 2.2

There is almost universal, but rather expensive solution, which can be used to read multiline JSON files:

  • Read data using SparkContex.wholeTextFiles.
  • Drop keys (file names).
  • Pass the result to the DataFrameReader.json.

As long as there are no other problems with your data it should do the trick:

spark.read.json(sc.wholeTextFiles(path_to_input).values())
Sign up to request clarification or add additional context in comments.

1 Comment

How do I import spark ?
1

I experienced a similar issue. When Spark is reading the Json file, it expects each line to be a separate JSON object. So it will fail if you will try to load a pretty formatted JSON file. My walk around it was to minify the JSON file that Spark was reading.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.