3

I have to parse through a really big file, modify its contents, and write that to another file. The file that I have right now is not that big in comparison to what it could be, but it's big nonetheless.

The file is 1.3 GB and contains about 7 million lines of this format:

8823192\t/home/pcastr/...

Where \t is a tab character. The number at the beginning is the apparent size of the path that follows.

I want an output file with lines looking like this (in csv format):

True,8823192,/home/pcastr/...

Where the first value is whether the path is a directory.

Currently, my code looks something like this:

with open(filepath, "r") as open_file:
    while True:
        line = open_file.readline()
        if line == "":  # Checks for the end of the file
            break
        size = line.split("\t")[0]
        path = line.strip().split("\t")[1]
        is_dir = os.path.isdir(path)

        streamed_file.write(unicode("{isdir},{size},{path}\n".format(isdir=is_dir, size=size, path=path))

A caveat with this is that files like this WILL get tremendously big, so I not only need a fast solution, but a memory efficient solution as well. I know that there is usually a trade off between these two qualities,

8
  • 1
    I don't have a full answer but you can use the file handler directly. for i in open_file Commented Jul 23, 2018 at 15:06
  • Is the current solution with 1.3GB file slow? Commented Jul 23, 2018 at 15:12
  • Do you mean for for i in open_file.readlines()? Would that read to memory though? Commented Jul 23, 2018 at 15:17
  • @PaoloCastro No; file objects (or whatever open returns in Python 3) are iterators; you can automatically read one a line at a time by using it as an iterator. It is rare to need to call readline explicitly. Commented Jul 23, 2018 at 15:18
  • 1
    i would take a slightly smaller file and time/comment out the different bits separately. 1. how much to read just the file. 2. split stuff. 3. write same line back to output. 4. format string (w unicode?). 5. isdir. That should give you a sense of what’s “costly”. You can then concentrate on optimizing those bits first. Commented Jul 24, 2018 at 2:42

3 Answers 3

7

The biggest gain is likely to come from calling split only once per line

size, path = line.strip().split("\t")
# or ...split("\t", 3)[0:2] if there are extra fields to ignore

You can at least simplify your code by treating the input file as an iterator and using the csv module. This might give you a speed-up as well, as it eliminates the need for an explicit call to split:

with open(filepath, "r") as open_file:
    reader = csv.reader(open_file, delimiter="\t")
    writer = csv.writer(streamed_file)
    for size, path in reader:
       is_dir = os.path.isdir(path)
       writer.writerow([is_dir, size, path])
Sign up to request clarification or add additional context in comments.

4 Comments

I will try this and get back to you when it finishes or when I have a good estimate on how long it takes.
This reduced the run time from 1.5 hours to 1 hour. This helps! Do you know of any more ways to optimize the code?
I wonder about the relative cost of isdir. Not that high, if split optimization improves by 30%, but still... are many entries likely to not exist? if /foo/ doesn’t exist, neither will /foo/bar/. would sorting help? again, doesn’t seem a bottleneck.
The logic to determine whether or not to call isdir would likely be more expensive than just calling it. Sorting would require reading at least all the sort keys into memory, and unless you read all the data into memory, reading the entire input twice. There are micro optimizations like saving opid = os.path.isdir before the loop to avoid the attribute lookups, but I have a feeling this is a mostly I/O-bound loop, and most of the runtime is simply spent reading from and writing to disk.
0

Compressing the file before copying trough the network could speed up the processing of data because you will get your data to your script faster.

Can you keep the input text file compressed on the remote target system? if yes, you could compress it to a format using an algorithm that is supported in python (modules zlib, gzip, bz2, lzma, zipfile)

If no you could at least run a script on remote storage system to compress the file. Next you would read the file and decompress it in memory using one of the python modules and then process each line.

3 Comments

I'm not familiar at all with compressing files. I'm assuming this would solve the memory usage part, but I worry that this will take a lot of computational time. Could you give me an example?
Compressing will solve speed of copying 1.3 gigs of data trough network. I will try to make a code example when i have more time. For example lzma module has open method, that can open compressed file like you would do with normal file. So you could reuse your original code.
I tested this, but unfortunately, it just as slow as processing the file from across networks(I don't know why). It is still taking roughly the same amount of time.
0

You might need mmap. Introduction and tutorial here.

As a simplification, it means you can treat files on disk as if they were in RAM, without actually reading the whole file into RAM.

4 Comments

I will explore this option. Thank you.
I have the feeling this is showing up with a rocket-launcher to a sword fight.
Even using mmap, you have the choice between using slicing and calling a read method, and the two are not equivalent in terms of performance.
Well, rocket-launcher or not, if it makes the code faster and more efficient, I'll take it. That is, if it's in the python 2.7 standard library. Reading the documentation, it doesn't look all that complicated.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.