0

I need to read a big datafile (~200GB) , line by line using a Python script.

I have tried the regular line by line methods, However those methods use a large amount of memory. I want to be able to read the file chunk by chunk.

Is there a better way to load a large file line by line, say

a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?

4
  • Two quick suggestions: you may want to explain why you need such a huge file in case your use case overlaps with an existing library and you should post some example code showing what you have tried. Commented Aug 20, 2014 at 18:42
  • This doesn't work for you? [stackoverflow.com/questions/8009882/… [1]: stackoverflow.com/questions/8009882/… Commented Aug 20, 2014 at 18:44
  • Is the file text or binary? For such a huge file, it is probably binary and you should use an idiom to read and process in appropriately sized binary chunks. Commented Aug 20, 2014 at 18:47
  • Simply reading line by line like for line in open('mybigfile'): does not use much memory (assuming the lines themselves aren't enormous). Have you tried this method? Commented Aug 20, 2014 at 21:04

2 Answers 2

2

Instead of reading it all at once, try reading it line by line:

with open("myFile.txt") as f:
    for line in f:
        #Do stuff with your line

Or, if you want to read N lines in at a time:

with open("myFile.txt") as myfile:
    head = [next(myfile) for x in xrange(N)]
    print head

To handle the StopIteration error that comes from hitting the end of the file, it's a simple try/catch (although there are plenty of ways).

try:
    head = [next(myfile) for x in xrange(N)]
except StopIteration:
    rest_of_lines = [line for line in myfile]

Or you can read those last lines in however you want.

Sign up to request clarification or add additional context in comments.

1 Comment

Your multi-line version raises StopIteration if you try to read past the end of the file
0

To iterate over the lines of a file, do not use readlines. Instead, iterate over the file itself (you will find versions using xreadlines - it is deprecated and simply returns the file object itself) or :

with open(the_path, 'r') as the_file:
    for line in the_file:
        # Do stuff with the line

To read multiple lines at a time, you can use next on the file (it is an iterator), but you need to catch StopIteration, which indicates that there is no data left:

with open(the_path, 'r') as the_file:
    the_lines = []
    done = False
    for i in range(number_of_lines): # Use xrange on Python 2
        try:
            the_lines.append(next(the_file))
        except StopIteration:
            done = True # Reached end of file
    # Do stuff with the lines
    if done:
        break # No data left

Of course, you can also load the file in chunks of a specified byte count:

with open(the_path, 'r') as the_file:
    while True:
        data = the_file.read(the_byte_count)
        if len(data) == 0:
            # All data is gone
            break
        # Do stuff with the data chunk

1 Comment

You probably don't want xreadlines (even though it does what you want) as it is deprecated in modern versions of Python.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.