Read a large big-endian binary file

Question

I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:

    data = []
    file = open('some_file.dat', 'rb')

    for i in range(0, numcount)
            data.append(struct.unpack('>f', file.read(4))[0])

But this code works very slow if file size is more than ~100 mb. My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.

I found another solution, that works faster:

    datafile = open('some_file.dat', 'rb').read()
    f_len = ">" + "f" * numcount   #numcount = 399513600

    numbers = struct.unpack(f_len, datafile)

This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.

In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?

I also have some bad experiences with struct; reading a file of ~1GB at once (your second example) totally maxes out the memory on my laptop (8GB), which then of course makes everything very slow. Reading it in chunks was the solution in my case. — Bart
– Bart, Commented Nov 3, 2016 at 9:35

Bakuriu · Accepted Answer · 2016-11-03 09:44:27Z

7

You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:

numpy.fromfile(filename, dtype='>f')

There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.

answered Nov 3, 2016 at 9:44

Bakuriu

103k23 gold badges206 silver badges236 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Martin Evans · Accepted Answer · 2016-11-03 09:50:43Z

The following approach gave a good speed up for me:

import struct
import random
import time


block_size = 4096
start = time.time()

with open('some_file.dat', 'rb') as f_input:    
    data = []

    while True:
        block = f_input.read(block_size * 4)
        data.extend(struct.unpack('>{}f'.format(len(block)/4), block))

        if len(block) < block_size * 4:
            break

print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)

Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.

From the struct - Format Characters documentation:

A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'.

This approach works well for relatively "small" files (1-2 GB). However, and given the title of the question, how would you go about reading a larger file (20-30 GB), in smaller chunks, without running out of memory? (I know it must include some partition of the file to work).

Pedro Nunes · Accepted Answer · 2019-10-09 07:59:54Z

def read_big_endian(filename):
    all_text = ""
    with open(filename, "rb") as template:
        try:
            template.read(2)  # first 2 bytes are FF FE
            while True:
                dchar = template.read(2)
                all_text += dchar[0]
        except:
            pass
    return all_text


def save_big_endian(filename, text):
    with open(filename, "wb") as fic:
        fic.write(chr(255) + chr(254))  # first 2 bytes are FF FE
        for letter in text:
            fic.write(letter + chr(0))

Used to read .rdp files

Collectives™ on Stack Overflow

Read a large big-endian binary file

3 Answers 3

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Linked

Related