1

I have a JSON file with musical acoustic features (its about 1GB). I am trying to read it into my pandas notebook using dataf = "/home/work/my.json" d = json.load(open(dataf, 'r')) It keeps giving me an error saying

Extra data: line 2 column 1 (char 499)

I understand that the 499th char is the start of the next track but I have looked online and am unsure of how to get it to read in. Below is a sample of the data.

{"_id":{"$oid":"5b2cff21aecd2a723459cd65"},"id":1,"sp_id":"0XLOf9LhyazPX9Ld8jPiUq","danceability":0.7079999999999999627,"energy":0.60999999999999998668,"key":"2","loudness":-4.5220000000000002416,"mode":"1","speechiness":0.057399999999999999634,"acousticness":0.020400000000000001465,"instrumentalness":4.4499999999999997457e-06,"liveness":0.064100000000000004197,"valence":0.30499999999999999334,"tempo":123.0379999999999967,"time_signature":"4","track_uri":"spotify:track:0XLOf9LhyazPX9Ld8jPiUq"} {"_id":{"$oid":"5b2cff21aecd2a723459cd66"},"id":2,"sp_id":"7aF09WaavZAmAWuUeYxlYD","danceability":0.59299999999999997158,"energy":0.86799999999999999378,"key":"1","loudness":-3.5729999999999999538,"mode":"0","speechiness":0.29499999999999998446,"acousticness":0.182999999999999996,"instrumentalness":0.0,"liveness":0.36499999999999999112,"valence":0.49599999999999999645,"tempo":104.98799999999999955,"time_signature":"4","track_uri":"spotify:track:7aF09WaavZAmAWuUeYxlYD"} {"_id":{"$oid":"5b2cff21aecd2a723459cd67"},"id":3,"sp_id":"0tKcYR2II1VCQWT79i5NrW","danceability":0.5999999999999999778,"energy":0.81000000000000005329,"key":"0","loudness":-4.748999999999999666,"mode":"1","speechiness":0.047899999999999998135,"acousticness":0.0068300000000000001335,"instrumentalness":0.20999999999999999223,"liveness":0.15499999999999999889,"valence":0.29799999999999998712,"tempo":167.87999999999999545,"time_signature":"4","track_uri":"spotify:track:0tKcYR2II1VCQWT79i5NrW"} {"_id":{"$oid":"5b2cff21aecd2a723459cd68"},"id":4,"sp_id":"6TWSVHx6z6E42JiwloGv1k","danceability":0.50300000000000000266,"energy":0.91800000000000003819,"key":"11","loudness":-5.0099999999999997868,"mode":"1","speechiness":0.046399999999999996803,"acousticness":0.016199999999999999123,"instrumentalness":0.024400000000000001549,"liveness":0.18599999999999999867,"valence":0.41799999999999998268,"tempo":140.0,"time_signature":"4","track_uri":"spotify:track:6TWSVHx6z6E42JiwloGv1k"} {"_id":{"$oid":"5b2cff21aecd2a723459cd69"},"id":5,"sp_id":"5QqyRUZeBE04yJxsD1OC0I","danceability":0.76000000000000000888,"energy":0.56100000000000005418,"key":"1","loudness":-8.6969999999999991758,"mode":"1","speechiness":0.13400000000000000799,"acousticness":0.018499999999999999084,"instrumentalness":1.9400000000000000604e-05,"liveness":0.19900000000000001021,"valence":0.12099999999999999645,"tempo":134.98300000000000409,"time_signature":"4","track_uri":"spotify:track:5QqyRUZeBE04yJxsD1OC0I"}

0

2 Answers 2

2

Your JSON won't parse because it's invalid JSON. The character the parser is complaining about is right after the first newline character. Clearly there are objects dumped line by line into the file, which together don't comprise a valid object. See:

>>> json.loads(s[:499])
{'_id': {'$oid': '5b2cff21aecd2a723459cd65'},
 'id': 1,
 'sp_id': '0XLOf9LhyazPX9Ld8jPiUq',
 'danceability': 0.708,
 'energy': 0.61,
 'key': '2',
 'loudness': -4.522,
 'mode': '1',
 'speechiness': 0.0574,
 'acousticness': 0.0204,
 'instrumentalness': 4.45e-06,
 'liveness': 0.0641,
 'valence': 0.305,
 'tempo': 123.038,
 'time_signature': '4',
 'track_uri': 'spotify:track:0XLOf9LhyazPX9Ld8jPiUq'}
>>> json.loads(s[499:973])
{'_id': {'$oid': '5b2cff21aecd2a723459cd66'},
 'id': 2,
 'sp_id': '7aF09WaavZAmAWuUeYxlYD',
 'danceability': 0.593,
 'energy': 0.868,
 'key': '1',
 'loudness': -3.573,
 'mode': '0',
 'speechiness': 0.295,
 'acousticness': 0.183,
 'instrumentalness': 0.0,
 'liveness': 0.365,
 'valence': 0.496,
 'tempo': 104.988,
 'time_signature': '4',
 'track_uri': 'spotify:track:7aF09WaavZAmAWuUeYxlYD'}

(s is your example input loaded into a string.) These objects are printed one after the other into the file. You either have to change the syntax so that it becomes a list of objects (add square brackets and commas), or parse the file line by line, calling json.loads on each line of the input.

Now, don't quote me on this one, but hacking your input so that it becomes valid JSON is quite easy:

>>> len(json.loads('[' + s.replace('\n', ',') + ']'))
5

In case the file is huge you will probably not want to do the above hack and ensuing parsing in one sitting, due to the huge memory overhead that incurs. In this case I suggest parsing your file object by object. Assuming your file contains one object on each line, you only need

dat = [json.loads(line) for line in open(infile)]

where infile is the path to your concatenated-JSON file. It will take long for a huge file, and the result will occupy a lot of memory, but I'd expect the additional overhead used for parsing to be less this way.

Sign up to request clarification or add additional context in comments.

Comments

1

Looks like you're reading records from a MongoDB database. What comes out is an array of JSON objects stored line by line, that means it is not a valid JSON object in itself, as pointed by @Andras

It seems like it would be a lot more efficient to read the data from MongoDB instead.

You can use PyMongo for this like this:

import pandas as pd
from pymongo import MongoClient

mdbClient = MongoClient('mongodb://localhost:27017/')
db = mdbClient['db']
collection = db['col']

results = collection.find({})
df = pd.DataFrame.from_records(results)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.