Python huge file reading [duplicate]

Question

I need to read a big datafile (~200GB) , line by line using a Python script.

I have tried the regular line by line methods, However those methods use a large amount of memory. I want to be able to read the file chunk by chunk.

Is there a better way to load a large file line by line, say

a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?

Two quick suggestions: you may want to explain why you need such a huge file in case your use case overlaps with an existing library and you should post some example code showing what you have tried. — Spaceghost
– Spaceghost, Commented Aug 20, 2014 at 18:42
This doesn't work for you? [stackoverflow.com/questions/8009882/… [1]: stackoverflow.com/questions/8009882/… — WitYoBadSelf
– WitYoBadSelf, Commented Aug 20, 2014 at 18:44
Is the file text or binary? For such a huge file, it is probably binary and you should use an idiom to read and process in appropriately sized binary chunks. — dawg
– dawg, Commented Aug 20, 2014 at 18:47
Simply reading line by line like for line in open('mybigfile'): does not use much memory (assuming the lines themselves aren't enormous). Have you tried this method? — tdelaney
– tdelaney, Commented Aug 20, 2014 at 21:04

score 2 · Accepted Answer · 2014-08-20 18:58:51Z

2

Instead of reading it all at once, try reading it line by line:

with open("myFile.txt") as f:
    for line in f:
        #Do stuff with your line

Or, if you want to read N lines in at a time:

with open("myFile.txt") as myfile:
    head = [next(myfile) for x in xrange(N)]
    print head

To handle the StopIteration error that comes from hitting the end of the file, it's a simple try/catch (although there are plenty of ways).

try:
    head = [next(myfile) for x in xrange(N)]
except StopIteration:
    rest_of_lines = [line for line in myfile]

Or you can read those last lines in however you want.

edited Aug 20, 2014 at 18:58

answered Aug 20, 2014 at 18:41

user2497586

Sign up to request clarification or add additional context in comments.

1 Comment

hlt Over a year ago

Your multi-line version raises StopIteration if you try to read past the end of the file

hlt · Accepted Answer · 2014-08-20 18:50:43Z

To iterate over the lines of a file, do not use readlines. Instead, iterate over the file itself (you will find versions using xreadlines - it is deprecated and simply returns the file object itself) or :

with open(the_path, 'r') as the_file:
    for line in the_file:
        # Do stuff with the line

To read multiple lines at a time, you can use next on the file (it is an iterator), but you need to catch StopIteration, which indicates that there is no data left:

with open(the_path, 'r') as the_file:
    the_lines = []
    done = False
    for i in range(number_of_lines): # Use xrange on Python 2
        try:
            the_lines.append(next(the_file))
        except StopIteration:
            done = True # Reached end of file
    # Do stuff with the lines
    if done:
        break # No data left

Of course, you can also load the file in chunks of a specified byte count:

with open(the_path, 'r') as the_file:
    while True:
        data = the_file.read(the_byte_count)
        if len(data) == 0:
            # All data is gone
            break
        # Do stuff with the data chunk

You probably don't want xreadlines (even though it does what you want) as it is deprecated in modern versions of Python.

Collectives™ on Stack Overflow

Python huge file reading [duplicate]

2 Answers 2

1 Comment

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Linked

Related