1

I have a script that does some poor man caching of data returned from an API to a flat file as JSON objects. One result/JSON object per line.

The caching workflow is as follows:

Read in entire cache file -> check if each line is too old, line by line -> save the ones that aren't too old to new list -> print the new fresh cache list to a file, and also use the new list as a filter to not work on that incoming data for the API calls.

By far the longest part of this process is bolded above. Here is the code:

print "Reading cache file into memory ---"
with open('cache', 'r') as f:
    cache_lines = f.readlines()

print "Turning cache lines into json and checking if they are stale or not ---"
for line in cache_lines
    # Load the line back up as a json object
    try:
        json_line = json.loads(line)
    except Exception as e:
        print e

    # Get the delta to determine if data is stale.
    delta = meta_dict["timestamp_start"] - parser.parse(json_line['timestamp_start'])

    # If the data is still fresh then hold onto it
    if cache_timeout >= delta:
        fresh_cache.append(json_line)

It can take minutes depending on the size of the hash file. Is there a faster way to do this? I understand reading the whole file in isn't ideal in the first place but it was easiest to implement.

2 Answers 2

1

Depending on your size of the file, it might causes memory issues. I don't know if it is the kind of issue you encounter. The previous code could be rewritten like this:

delta = meta_dict['timestamp_start']
with open('cache', 'r') as f:
    while True:
        line = f.readline()
        if not line:
            break
        line = json.loads(line)
        if delta - parser.parse(line['timestamp_start']) <= cache_timeout:
            fresh_cache.append(json_line)

Also,

  • not that if you using dateutils to parse date, each call might be costly. If your format is known, you may want to use standard transformation tools provided by datetime or dateutils
  • if your file is really big and fresh_cache has to be really big, you can write fresh entries on an intermediate file using another with statement.
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the input. I was hoping for some black magic but it looks like I'm out of luck. I'll try not parser.parsing every call and see if that helps.
you can try simplejson library too, which is faster than the standard json library...
Reporting back - 1. simplejson had little to no effect. 2. Doing a manual datetime extraction had a great effect. It reduced the time from 8m11.578s to 2m55.681s. This replaced the parser.parse line from above: datetime.datetime.strptime(json_line['timestamp_start'], "%Y-%m-%d %H:%M:%S.%f")
@Thisisstackoverflow, I wonder if you could identify the time stamp part with regex. It would avoid the json parsing and may improve speed.. Not sure of what could be the gain.
0

Reporting back - 1. simplejson had little to no effect. 2. Doing a manual datetime extraction had a great effect. It reduced the time from 8m11.578s to 2m55.681s. This replaced the parser.parse line from above: datetime.datetime.strptime(json_line['timestamp_start'], "%Y-%m-%d %H:%M:%S.%f") –

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.