Loading and parsing a JSON file with multiple JSON objects

Question

I am trying to load and parse a JSON file in Python. But I'm stuck trying to load the file:

import json
json_data = open('file')
data = json.load(json_data)

Yields:

ValueError: Extra data: line 2 column 1 - line 225116 column 1 (char 232 - 160128774)

I looked at 18.2. json — JSON encoder and decoder in the Python documentation, but it's pretty discouraging to read through this horrible-looking documentation.

First few lines (anonymized with randomized entries):

{"votes": {"funny": 2, "useful": 5, "cool": 1}, "user_id": "harveydennis", "name": "Jasmine Graham", "url": "http://example.org/user_details?userid=harveydennis", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 2, "cool": 4}, "user_id": "njohnson", "name": "Zachary Ballard", "url": "https://www.example.com/user_details?userid=njohnson", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 0, "cool": 4}, "user_id": "david06", "name": "Jonathan George", "url": "https://example.com/user_details?userid=david06", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 6, "useful": 5, "cool": 0}, "user_id": "santiagoerika", "name": "Amanda Taylor", "url": "https://www.example.com/user_details?userid=santiagoerika", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 8, "cool": 2}, "user_id": "rodriguezdennis", "name": "Jennifer Roach", "url": "http://www.example.com/user_details?userid=rodriguezdennis", "average_stars": 3.5, "review_count": 12, "type": "user"}

Community · Accepted Answer · 2017-05-23 11:47:23Z

314

You have a JSON Lines format text file. You need to parse your file line by line:

import json

data = []
with open('file') as f:
    for line in f:
        data.append(json.loads(line))

Each line contains valid JSON, but as a whole, it is not a valid JSON value as there is no top-level list or object definition.

Note that because the file contains JSON per line, you are saved the headaches of trying to parse it all in one go or to figure out a streaming JSON parser. You can now opt to process each line separately before moving on to the next, saving memory in the process. You probably don't want to append each result to one list and then process everything if your file is really big.

If you have a file containing individual JSON objects with delimiters in-between, use How do I use the 'json' module to read in one JSON object at a time? to parse out individual objects using a buffered method.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Sep 16, 2012 at 23:08

Martijn Pieters

1.1m325 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Tadeck Over a year ago

+1 Maybe it is worth noting, that if you do not need all objects at once, processing them one-by-one may be more efficient approach. This way you will not need to store whole data in the memory, but a single piece of it.

Martijn Pieters Over a year ago

@Pi_: you'll have a dictionary, so just access the fields as keys: data = json.loads(line); print data[u'votes']

Martijn Pieters Over a year ago

@Pi_: print the result of json.loads() then or use the debugger to inspect.

Martijn Pieters Over a year ago

@Pi_: no; don't confuse the JSON format with the python dict representation. You are seeing python dictionaries with strings now.

Martijn Pieters Over a year ago

@user2441441: see the linked answer from the post here.

|

arunppsg · Accepted Answer · 2021-07-27 09:13:54Z

29

In case you are using pandas and you will be interested in loading the json file as a dataframe, you can use:

import pandas as pd
df = pd.read_json('file.json', lines=True)

And to convert it into a json array, you can use:

df.to_json('new_file.json')

answered Jul 27, 2021 at 9:13

arunppsg

1,60219 silver badges26 bronze badges

2 Comments

Green Over a year ago

This answer is the most pythonic in my view.

José L. Patiño May 13 at 10:27

This answer basically requires you to install Pandas, a fat library with its dependencies that you may not need at all in your project, only to do what should be a simple operation.

wouter bolsterlee · Accepted Answer · 2020-02-08 17:26:20Z

20

for those stumbling upon this question: the python jsonlines library (much younger than this question) elegantly handles files with one json document per line. see https://jsonlines.readthedocs.io/

edited Feb 8, 2020 at 17:26

answered Jul 17, 2017 at 18:00

wouter bolsterlee

4,06724 silver badges31 bronze badges

Comments

Ken · Accepted Answer · 2023-08-01 14:01:26Z

Just like Martijn Pieters' answer but maybe a bit more pythonic, and most of all, which enables streaming of data (see second part of the answer):

import json

with open(filepath, "r") as f:
    return list(map(json.loads, f))

The map(function, iterable) function returns an iterator that applies function to every item of iterable, yielding the results (cf map() python doc).
And the list transforms this iterator into... a list :)
But you can imagine to directly use the iterator returned by map instead: it iterates over each of your json lines. Note that in that case you need to do it in the with open(filepath, "r") as f context: that is the strength of this approach, the json lines are not fully loaded in a list, they are streamed: the map function read each line of the file when next(iterator) is called by the for loop.
It would give:

import json

with open(file path, "r") as f:
    iterator_over_lines = map(json.loads, f)
    # just as you would do with a list but here the file is streamed
    for jsonline in iterator_over_lines:
         # do something for each line
    # the function mapped, json.loads is only call on each iteration
    # that's why the file must stay opened

    # You can try to call yourself the next function used by the for loop:
    next_jsonline = next(iterator_over_lines)
    nextnext_jsonline = next(iterator_over_lines)

And I have nothing to add to Martijn's answer for explanations about what is a jsonl (json line by line file) and why use it!

Daniel Roseman · Accepted Answer · 2012-09-16 23:09:16Z

0

That is ill-formatted. You have one JSON object per line, but they are not contained in a larger data structure (ie an array). You'll either need to reformat it so that it begins with [ and ends with ] with a comma at the end of each line, or parse it line by line as separate dictionaries.

answered Sep 16, 2012 at 23:09

Daniel Roseman

602k68 gold badges910 silver badges923 bronze badges

3 Comments

Martijn Pieters Over a year ago

With a 50MB file the OP is probably better off dealing with the data line by line anyway. :-)

Mr. Lance E Sloan Over a year ago

Whether the file is ill-formatted depends on one's point of view. If it was intended to be in the "JSON lines" format, it's valid. See: jsonlines.org

doug65536 Over a year ago

I love how browsers throw away 2500MB at a time, and people don't want to use 50MB to actually process something.

Angus · Accepted Answer · 2022-10-07 06:38:57Z

Add-on to @arunppsg's answer, but with multiprocessing to deal with a large number of files in a directory.

import numpy as np
import pandas as pd
import json
import os
import multiprocessing as mp
import time

directory = 'your_directory'

def read_json(json_files):
    df = pd.DataFrame()
    for j in json_files:
        with open(os.path.join(directory, j)) as f:
            df = df.append(pd.read_json(f, lines=True)) # if there's multiple lines in the json file, flag lines to true, false otherwise.
    return df

def parallelize_json(json_files, func):
    json_files_split = np.array_split(json_files, 10)
    pool = mp.Pool(mp.cpu_count())
    df = pd.concat(pool.map(func, json_files_split))
    pool.close()
    pool.join()
    return df

# start the timer
start = time.time()

# read all json files in parallel
df = parallelize_json(json_files, read_json)

# end the timer
end = time.time()

# print the time taken to read all json files
print(end - start)

José L. Patiño · Accepted Answer · 2025-05-13 10:32:27Z

0

This is 12 years late, but what the OP is using is called NDJSON and is used in streaming applications, so a very common format nowadays. Maybe other answers would be more computing efficient, but here is another simple way you can do it in case someone finds it useful:

with open("data.ndjson", "r") as file:
    data = file.read().split()

data = list(map(lambda n: json.loads(n), data))

answered May 13 at 10:32

José L. Patiño

3,8302 gold badges31 silver badges30 bronze badges

Collectives™ on Stack Overflow

Loading and parsing a JSON file with multiple JSON objects

7 Answers 7

11 Comments

2 Comments

Comments

Comments

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

11 Comments

2 Comments

Comments

Comments

3 Comments

Comments

Comments

Linked

Related