0

I am making a program, which should be able to encode any type of file using huffman algorithm. It all works, but using it on large files is too slow (at least I think it is). When I tried to open an 120MB mp4 file to unpack it, it took me about 210s just to read the file. Not to mention that it took a large chunk of memory to do so. I thought unpacking using struct would be efficient, but it isnt. Isn't there more effiecent way to do it in python? I need to read any file by bytes and then pass it to the huffman method in string.

if __name__ == "__main__":
    start = time.time()
    with open('D:\mov.mp4', 'rb') as f:
        dataL = f.read()
    data = np.zeros(len(dataL), 'uint8')

    for i in range(0, len(dataL)):
        data[i] = struct.unpack('B', dataL[i])[0]

    data.tostring()

    end = time.time()
    print("Original file read: ")
    print end - start

    encoded, table = huffman_encode(data)
4
  • Have you tried using pypy? Commented Oct 30, 2015 at 8:48
  • That's strange. It took less than a minute to read a 3GB file on my computer. Isn't your D drive removable or network? (that could explain the slowness), or extremely fragmented? Commented Oct 30, 2015 at 9:02
  • Using the exact same code? I can try it when I get home on my main desktop, but I still need it to work on my laptop too. Btw the laptop has i5 2410M, 5400rpm hdd and 8GB RAM. The D is just regular partition. And my system has freshly installed w10. I don't think there should be such a marginal difference though. When I run it, it uses just 30% of my CPU, but 4GB of RAM which I guess is not optimal. And I haven't tried pypy. I would rather stick to the regular python. Commented Oct 30, 2015 at 9:09
  • Yes, using a simple f.read(). My system is a little bit more performant. I see the bottleneck being the HDD (for your read op). There isn't/shouldn't be too much CPU work. Also I had 4+ GB of RAM free, so that when loading the file it's contents should be kept in memory instead to cache it in the swap file, which would make it slower. Commented Oct 30, 2015 at 10:25

1 Answer 1

2

Your approach is loading a file into a python object -> creating an empty Numpy array then filling the Numpy array bit by bit using a Python iterator.

Lets take out the middlemen:

if __name__ == "__main__":
    start = time.time()
    data = np.fromfile('d:\mov.mp4', dtype=np.uint8, count=-1)
    end = time.time()
    print("Original file read: ")
    print end - start
    encoded, table = huffman_encode(data)

What to do with 'data' depends on what type of data your huffman_encode(data) will receive. I would try to avoid using strings.

Documentation on the call is here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html

  • I would be interested to hear the speed differences in the comments :)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for your advice. The same 120MB file which took me now 72s (the previous 210 was while I was using the laptop, so it got a lil longer) took now 0.0629s! And memory consumption went down from 4GB to 200MB :) So this is blazing fast now. Now I just gotta figure out why the coding itself takes so much

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.