160

I am trying to load and parse a JSON file in Python. But I'm stuck trying to load the file:

import json
json_data = open('file')
data = json.load(json_data)

Yields:

ValueError: Extra data: line 2 column 1 - line 225116 column 1 (char 232 - 160128774)

I looked at 18.2. json — JSON encoder and decoder in the Python documentation, but it's pretty discouraging to read through this horrible-looking documentation.

First few lines (anonymized with randomized entries):

{"votes": {"funny": 2, "useful": 5, "cool": 1}, "user_id": "harveydennis", "name": "Jasmine Graham", "url": "http://example.org/user_details?userid=harveydennis", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 2, "cool": 4}, "user_id": "njohnson", "name": "Zachary Ballard", "url": "https://www.example.com/user_details?userid=njohnson", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 0, "cool": 4}, "user_id": "david06", "name": "Jonathan George", "url": "https://example.com/user_details?userid=david06", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 6, "useful": 5, "cool": 0}, "user_id": "santiagoerika", "name": "Amanda Taylor", "url": "https://www.example.com/user_details?userid=santiagoerika", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 8, "cool": 2}, "user_id": "rodriguezdennis", "name": "Jennifer Roach", "url": "http://www.example.com/user_details?userid=rodriguezdennis", "average_stars": 3.5, "review_count": 12, "type": "user"}
0

7 Answers 7

314

You have a JSON Lines format text file. You need to parse your file line by line:

import json

data = []
with open('file') as f:
    for line in f:
        data.append(json.loads(line))

Each line contains valid JSON, but as a whole, it is not a valid JSON value as there is no top-level list or object definition.

Note that because the file contains JSON per line, you are saved the headaches of trying to parse it all in one go or to figure out a streaming JSON parser. You can now opt to process each line separately before moving on to the next, saving memory in the process. You probably don't want to append each result to one list and then process everything if your file is really big.

If you have a file containing individual JSON objects with delimiters in-between, use How do I use the 'json' module to read in one JSON object at a time? to parse out individual objects using a buffered method.

Sign up to request clarification or add additional context in comments.

11 Comments

+1 Maybe it is worth noting, that if you do not need all objects at once, processing them one-by-one may be more efficient approach. This way you will not need to store whole data in the memory, but a single piece of it.
@Pi_: you'll have a dictionary, so just access the fields as keys: data = json.loads(line); print data[u'votes']
@Pi_: print the result of json.loads() then or use the debugger to inspect.
@Pi_: no; don't confuse the JSON format with the python dict representation. You are seeing python dictionaries with strings now.
@user2441441: see the linked answer from the post here.
|
29

In case you are using pandas and you will be interested in loading the json file as a dataframe, you can use:

import pandas as pd
df = pd.read_json('file.json', lines=True)

And to convert it into a json array, you can use:

df.to_json('new_file.json')

2 Comments

This answer is the most pythonic in my view.
This answer basically requires you to install Pandas, a fat library with its dependencies that you may not need at all in your project, only to do what should be a simple operation.
20

for those stumbling upon this question: the python jsonlines library (much younger than this question) elegantly handles files with one json document per line. see https://jsonlines.readthedocs.io/

Comments

2

Just like Martijn Pieters' answer but maybe a bit more pythonic, and most of all, which enables streaming of data (see second part of the answer):

import json

with open(filepath, "r") as f:
    return list(map(json.loads, f))

The map(function, iterable) function returns an iterator that applies function to every item of iterable, yielding the results (cf map() python doc).
And the list transforms this iterator into... a list :)
But you can imagine to directly use the iterator returned by map instead: it iterates over each of your json lines. Note that in that case you need to do it in the with open(filepath, "r") as f context: that is the strength of this approach, the json lines are not fully loaded in a list, they are streamed: the map function read each line of the file when next(iterator) is called by the for loop.
It would give:

import json

with open(file path, "r") as f:
    iterator_over_lines = map(json.loads, f)
    # just as you would do with a list but here the file is streamed
    for jsonline in iterator_over_lines:
         # do something for each line
    # the function mapped, json.loads is only call on each iteration
    # that's why the file must stay opened

    # You can try to call yourself the next function used by the for loop:
    next_jsonline = next(iterator_over_lines)
    nextnext_jsonline = next(iterator_over_lines)
    

And I have nothing to add to Martijn's answer for explanations about what is a jsonl (json line by line file) and why use it!

Comments

0

That is ill-formatted. You have one JSON object per line, but they are not contained in a larger data structure (ie an array). You'll either need to reformat it so that it begins with [ and ends with ] with a comma at the end of each line, or parse it line by line as separate dictionaries.

3 Comments

With a 50MB file the OP is probably better off dealing with the data line by line anyway. :-)
Whether the file is ill-formatted depends on one's point of view. If it was intended to be in the "JSON lines" format, it's valid. See: jsonlines.org
I love how browsers throw away 2500MB at a time, and people don't want to use 50MB to actually process something.
0

Add-on to @arunppsg's answer, but with multiprocessing to deal with a large number of files in a directory.

import numpy as np
import pandas as pd
import json
import os
import multiprocessing as mp
import time

directory = 'your_directory'

def read_json(json_files):
    df = pd.DataFrame()
    for j in json_files:
        with open(os.path.join(directory, j)) as f:
            df = df.append(pd.read_json(f, lines=True)) # if there's multiple lines in the json file, flag lines to true, false otherwise.
    return df

def parallelize_json(json_files, func):
    json_files_split = np.array_split(json_files, 10)
    pool = mp.Pool(mp.cpu_count())
    df = pd.concat(pool.map(func, json_files_split))
    pool.close()
    pool.join()
    return df

# start the timer
start = time.time()

# read all json files in parallel
df = parallelize_json(json_files, read_json)

# end the timer
end = time.time()

# print the time taken to read all json files
print(end - start)

Comments

0

This is 12 years late, but what the OP is using is called NDJSON and is used in streaming applications, so a very common format nowadays. Maybe other answers would be more computing efficient, but here is another simple way you can do it in case someone finds it useful:

with open("data.ndjson", "r") as file:
    data = file.read().split()

data = list(map(lambda n: json.loads(n), data))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.