How to optimize reading in and working over a large file?

Question

I have a script that does some poor man caching of data returned from an API to a flat file as JSON objects. One result/JSON object per line.

The caching workflow is as follows:

Read in entire cache file -> check if each line is too old, line by line -> save the ones that aren't too old to new list -> print the new fresh cache list to a file, and also use the new list as a filter to not work on that incoming data for the API calls.

By far the longest part of this process is bolded above. Here is the code:

print "Reading cache file into memory ---"
with open('cache', 'r') as f:
    cache_lines = f.readlines()

print "Turning cache lines into json and checking if they are stale or not ---"
for line in cache_lines
    # Load the line back up as a json object
    try:
        json_line = json.loads(line)
    except Exception as e:
        print e

    # Get the delta to determine if data is stale.
    delta = meta_dict["timestamp_start"] - parser.parse(json_line['timestamp_start'])

    # If the data is still fresh then hold onto it
    if cache_timeout >= delta:
        fresh_cache.append(json_line)

It can take minutes depending on the size of the hash file. Is there a faster way to do this? I understand reading the whole file in isn't ideal in the first place but it was easiest to implement.

ohe · Accepted Answer · 2015-12-21 21:45:38Z

1

Depending on your size of the file, it might causes memory issues. I don't know if it is the kind of issue you encounter. The previous code could be rewritten like this:

delta = meta_dict['timestamp_start']
with open('cache', 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        line = json.loads(line)
        if delta - parser.parse(line['timestamp_start']) <= cache_timeout:
            fresh_cache.append(json_line)

Also,

not that if you using dateutils to parse date, each call might be costly. If your format is known, you may want to use standard transformation tools provided by datetime or dateutils
if your file is really big and fresh_cache has to be really big, you can write fresh entries on an intermediate file using another with statement.

edited Dec 21, 2015 at 21:45

answered Dec 21, 2015 at 21:40

ohe

3,7133 gold badges29 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Thisisstackoverflow Over a year ago

Thanks for the input. I was hoping for some black magic but it looks like I'm out of luck. I'll try not parser.parsing every call and see if that helps.

ohe Over a year ago

you can try simplejson library too, which is faster than the standard json library...

Thisisstackoverflow Over a year ago

Reporting back - 1. simplejson had little to no effect. 2. Doing a manual datetime extraction had a great effect. It reduced the time from 8m11.578s to 2m55.681s. This replaced the parser.parse line from above: datetime.datetime.strptime(json_line['timestamp_start'], "%Y-%m-%d %H:%M:%S.%f")

ohe Over a year ago

@Thisisstackoverflow, I wonder if you could identify the time stamp part with regex. It would avoid the json parsing and may improve speed.. Not sure of what could be the gain.

Thisisstackoverflow · Accepted Answer · 2015-12-21 23:31:07Z

0

Reporting back - 1. simplejson had little to no effect. 2. Doing a manual datetime extraction had a great effect. It reduced the time from 8m11.578s to 2m55.681s. This replaced the parser.parse line from above: datetime.datetime.strptime(json_line['timestamp_start'], "%Y-%m-%d %H:%M:%S.%f") –

answered Dec 21, 2015 at 23:31

Thisisstackoverflow

2611 gold badge3 silver badges12 bronze badges

Collectives™ on Stack Overflow

How to optimize reading in and working over a large file?

2 Answers 2

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Related