I have to parse through a really big file, modify its contents, and write that to another file. The file that I have right now is not that big in comparison to what it could be, but it's big nonetheless.
The file is 1.3 GB and contains about 7 million lines of this format:
8823192\t/home/pcastr/...
Where \t is a tab character. The number at the beginning is the apparent size of the path that follows.
I want an output file with lines looking like this (in csv format):
True,8823192,/home/pcastr/...
Where the first value is whether the path is a directory.
Currently, my code looks something like this:
with open(filepath, "r") as open_file:
while True:
line = open_file.readline()
if line == "": # Checks for the end of the file
break
size = line.split("\t")[0]
path = line.strip().split("\t")[1]
is_dir = os.path.isdir(path)
streamed_file.write(unicode("{isdir},{size},{path}\n".format(isdir=is_dir, size=size, path=path))
A caveat with this is that files like this WILL get tremendously big, so I not only need a fast solution, but a memory efficient solution as well. I know that there is usually a trade off between these two qualities,
for i in open_filefor i in open_file.readlines()? Would that read to memory though?openreturns in Python 3) are iterators; you can automatically read one a line at a time by using it as an iterator. It is rare to need to callreadlineexplicitly.