Efficient way to parse through huge file

Question

I have to parse through a really big file, modify its contents, and write that to another file. The file that I have right now is not that big in comparison to what it could be, but it's big nonetheless.

The file is 1.3 GB and contains about 7 million lines of this format:

8823192\t/home/pcastr/...

Where \t is a tab character. The number at the beginning is the apparent size of the path that follows.

I want an output file with lines looking like this (in csv format):

True,8823192,/home/pcastr/...

Where the first value is whether the path is a directory.

Currently, my code looks something like this:

with open(filepath, "r") as open_file:
    while True:
        line = open_file.readline()
        if line == "":  # Checks for the end of the file
            break
        size = line.split("\t")[0]
        path = line.strip().split("\t")[1]
        is_dir = os.path.isdir(path)

        streamed_file.write(unicode("{isdir},{size},{path}\n".format(isdir=is_dir, size=size, path=path))

A caveat with this is that files like this WILL get tremendously big, so I not only need a fast solution, but a memory efficient solution as well. I know that there is usually a trade off between these two qualities,

I don't have a full answer but you can use the file handler directly. for i in open_file — Plopp
– Plopp, Commented Jul 23, 2018 at 15:06
Do you mean for for i in open_file.readlines()? Would that read to memory though? — Paolo Castro
– Paolo Castro, Commented Jul 23, 2018 at 15:17
@PaoloCastro No; file objects (or whatever open returns in Python 3) are iterators; you can automatically read one a line at a time by using it as an iterator. It is rare to need to call readline explicitly. — chepner
– chepner, Commented Jul 23, 2018 at 15:18
i would take a slightly smaller file and time/comment out the different bits separately. 1. how much to read just the file. 2. split stuff. 3. write same line back to output. 4. format string (w unicode?). 5. isdir. That should give you a sense of what’s “costly”. You can then concentrate on optimizing those bits first. — JL Peyret
– JL Peyret, Commented Jul 24, 2018 at 2:42

chepner · Accepted Answer · 2018-07-23 15:11:50Z

7

The biggest gain is likely to come from calling split only once per line

size, path = line.strip().split("\t")
# or ...split("\t", 3)[0:2] if there are extra fields to ignore

You can at least simplify your code by treating the input file as an iterator and using the csv module. This might give you a speed-up as well, as it eliminates the need for an explicit call to split:

with open(filepath, "r") as open_file:
    reader = csv.reader(open_file, delimiter="\t")
    writer = csv.writer(streamed_file)
    for size, path in reader:
       is_dir = os.path.isdir(path)
       writer.writerow([is_dir, size, path])

answered Jul 23, 2018 at 15:11

chepner

538k77 gold badges594 silver badges746 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Paolo Castro Over a year ago

I will try this and get back to you when it finishes or when I have a good estimate on how long it takes.

Paolo Castro Over a year ago

This reduced the run time from 1.5 hours to 1 hour. This helps! Do you know of any more ways to optimize the code?

JL Peyret Over a year ago

I wonder about the relative cost of isdir. Not that high, if split optimization improves by 30%, but still... are many entries likely to not exist? if /foo/ doesn’t exist, neither will /foo/bar/. would sorting help? again, doesn’t seem a bottleneck.

chepner Over a year ago

The logic to determine whether or not to call isdir would likely be more expensive than just calling it. Sorting would require reading at least all the sort keys into memory, and unless you read all the data into memory, reading the entire input twice. There are micro optimizations like saving opid = os.path.isdir before the loop to avoid the attribute lookups, but I have a feeling this is a mostly I/O-bound loop, and most of the runtime is simply spent reading from and writing to disk.

nio · Accepted Answer · 2018-07-23 16:59:33Z

0

Compressing the file before copying trough the network could speed up the processing of data because you will get your data to your script faster.

Can you keep the input text file compressed on the remote target system? if yes, you could compress it to a format using an algorithm that is supported in python (modules zlib, gzip, bz2, lzma, zipfile)

If no you could at least run a script on remote storage system to compress the file. Next you would read the file and decompress it in memory using one of the python modules and then process each line.

edited Jul 23, 2018 at 16:59

answered Jul 23, 2018 at 15:50

nio

5,2992 gold badges26 silver badges36 bronze badges

3 Comments

Paolo Castro Over a year ago

I'm not familiar at all with compressing files. I'm assuming this would solve the memory usage part, but I worry that this will take a lot of computational time. Could you give me an example?

nio Over a year ago

Compressing will solve speed of copying 1.3 gigs of data trough network. I will try to make a code example when i have more time. For example lzma module has open method, that can open compressed file like you would do with normal file. So you could reuse your original code.

Paolo Castro Over a year ago

I tested this, but unfortunately, it just as slow as processing the file from across networks(I don't know why). It is still taking roughly the same amount of time.

jfowkes · Accepted Answer · 2018-08-03 13:13:21Z

0

You might need mmap. Introduction and tutorial here.

As a simplification, it means you can treat files on disk as if they were in RAM, without actually reading the whole file into RAM.

edited Aug 3, 2018 at 13:13

answered Jul 23, 2018 at 15:11

jfowkes

1,57510 silver badges21 bronze badges

4 Comments

Paolo Castro Over a year ago

I will explore this option. Thank you.

Plopp Over a year ago

I have the feeling this is showing up with a rocket-launcher to a sword fight.

chepner Over a year ago

Even using mmap, you have the choice between using slicing and calling a read method, and the two are not equivalent in terms of performance.

Paolo Castro Over a year ago

Well, rocket-launcher or not, if it makes the code faster and more efficient, I'll take it. That is, if it's in the python 2.7 standard library. Reading the documentation, it doesn't look all that complicated.

Collectives™ on Stack Overflow

Efficient way to parse through huge file

3 Answers 3

4 Comments

3 Comments

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

3 Comments

4 Comments

Related