The fastest way to read binary files by bytes in Python

Question

I am making a program, which should be able to encode any type of file using huffman algorithm. It all works, but using it on large files is too slow (at least I think it is). When I tried to open an 120MB mp4 file to unpack it, it took me about 210s just to read the file. Not to mention that it took a large chunk of memory to do so. I thought unpacking using struct would be efficient, but it isnt. Isn't there more effiecent way to do it in python? I need to read any file by bytes and then pass it to the huffman method in string.

if __name__ == "__main__":
    start = time.time()
    with open('D:\mov.mp4', 'rb') as f:
        dataL = f.read()
    data = np.zeros(len(dataL), 'uint8')

    for i in range(0, len(dataL)):
        data[i] = struct.unpack('B', dataL[i])[0]

    data.tostring()

    end = time.time()
    print("Original file read: ")
    print end - start

    encoded, table = huffman_encode(data)

That's strange. It took less than a minute to read a 3GB file on my computer. Isn't your D drive removable or network? (that could explain the slowness), or extremely fragmented? — CristiFati
– CristiFati, Commented Oct 30, 2015 at 9:02
Using the exact same code? I can try it when I get home on my main desktop, but I still need it to work on my laptop too. Btw the laptop has i5 2410M, 5400rpm hdd and 8GB RAM. The D is just regular partition. And my system has freshly installed w10. I don't think there should be such a marginal difference though. When I run it, it uses just 30% of my CPU, but 4GB of RAM which I guess is not optimal. And I haven't tried pypy. I would rather stick to the regular python. — Arcane
– Arcane, Commented Oct 30, 2015 at 9:09
Yes, using a simple f.read(). My system is a little bit more performant. I see the bottleneck being the HDD (for your read op). There isn't/shouldn't be too much CPU work. Also I had 4+ GB of RAM free, so that when loading the file it's contents should be kept in memory instead to cache it in the swap file, which would make it slower. — CristiFati
– CristiFati, Commented Oct 30, 2015 at 10:25

Back2Basics · Accepted Answer · 2015-10-30 09:15:43Z

2

Your approach is loading a file into a python object -> creating an empty Numpy array then filling the Numpy array bit by bit using a Python iterator.

Lets take out the middlemen:

if __name__ == "__main__":
    start = time.time()
    data = np.fromfile('d:\mov.mp4', dtype=np.uint8, count=-1)
    end = time.time()
    print("Original file read: ")
    print end - start
    encoded, table = huffman_encode(data)

What to do with 'data' depends on what type of data your huffman_encode(data) will receive. I would try to avoid using strings.

Documentation on the call is here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html

I would be interested to hear the speed differences in the comments :)

edited Oct 30, 2015 at 9:15

answered Oct 30, 2015 at 9:10

Back2Basics

7,8462 gold badges35 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Arcane Over a year ago

Thank you for your advice. The same 120MB file which took me now 72s (the previous 210 was while I was using the laptop, so it got a lil longer) took now 0.0629s! And memory consumption went down from 4GB to 200MB :) So this is blazing fast now. Now I just gotta figure out why the coding itself takes so much

Collectives™ on Stack Overflow

The fastest way to read binary files by bytes in Python

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related