4

One of the answers for this question says that the following is a good way to read a large binary file without reading the whole thing into memory first:

 with open(image_filename, 'rb') as content:
     for line in content:
         #do anything you want

I thought the whole point of specifying 'rb' is that the line endings are ignored, therefore how could for line in content work?

Is this the most "Pythonic" way to read a large binary file or is there a better way?

6
  • I just posted your question as a comment below the answer in that question. That seems better than asking a new question. Commented Jul 25, 2015 at 23:44
  • @Ah thanks, what should I do with this question? Commented Jul 25, 2015 at 23:45
  • Well, it's too late to delete it, since someone answer. Commented Jul 25, 2015 at 23:45
  • Possibly a duplicate Commented Jul 25, 2015 at 23:47
  • Well all the answers are helpful, I can't accept an answer for 4 more minutes though, my apologies if it should have been a comment. Commented Jul 25, 2015 at 23:50

3 Answers 3

4

I would write a simple helper function to read in the chunks you want:

def read_in_chunks(infile, chunk_size=1024):
    while True:
        chunk = infile.read(chunk_size)
        if chunk:
            yield chunk
        else:
            # The chunk was empty, which means we're at the end
            # of the file
            return

The use as you would for line in file like so:

with open(fn. 'rb') as f:
    for chunk in read_in_chunks(f):
        # do you stuff on that chunk...

BTW: I asked THIS question 5 years ago and this is a variant of an answer at that time...


You can also do:

from collections import partial
with open(fn,'rb') as f:
    for chunk in iter(functools.partial(f.read, numBytes),''):
Sign up to request clarification or add additional context in comments.

4 Comments

I am reading that question now. I guess this is kind of duplicate of that (sorry I didn't see that). As a follow up, how do you determine the right chunk_size
What is the characteristic of each chunk? How will you process it? Is the file too big to read in one go? When you have for record in file: there is usually some record like relationship in each record to the whole file. You need to say more.
5 Years ago you were a "Python newbie"?
Indeed I was. Perl was my weapon before that and C before that.
4

for line in fh will split at new lines regardless of how you open the file

often with binary files you consume them in chunks

CHUNK_SIZE=1024
for chunk in iter(lambda:fh.read(CHUNK_SIZE),""):
    do_something(chunk)

Comments

3

Binary mode means that the line endings aren’t converted and that bytes objects are read (in Python 3); the file will still be read by “line” when using for line in f. I’d use read to read in consistent chunks instead, though.

with open(image_filename, 'rb') as f:
    # iter(callable, sentinel) – yield f.read(4096) until b'' appears
    for chunk in iter(lambda: f.read(4096), b''):
        …

3 Comments

Why the size of 4096?
cause you have to pick a size ... it doesnt matter which (great answer minitech)
Well which does matter. Just not by much. You must have that much memory free to use all at once. Otherwise why chunk? Slurp up the whole file. The problem with minitech or Joran trying to tell you how big it should be is that they don't know your system requirements, environment, or use case. When in doubt try it out. Multiples of 2 are popular because they're easy for system to manage.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.