7

I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:

    data = []
    file = open('some_file.dat', 'rb')

    for i in range(0, numcount)
            data.append(struct.unpack('>f', file.read(4))[0])

But this code works very slow if file size is more than ~100 mb. My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.

I found another solution, that works faster:

    datafile = open('some_file.dat', 'rb').read()
    f_len = ">" + "f" * numcount   #numcount = 399513600

    numbers = struct.unpack(f_len, datafile)

This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.

In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?

1
  • I also have some bad experiences with struct; reading a file of ~1GB at once (your second example) totally maxes out the memory on my laptop (8GB), which then of course makes everything very slow. Reading it in chunks was the solution in my case. Commented Nov 3, 2016 at 9:35

3 Answers 3

7

You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:

numpy.fromfile(filename, dtype='>f')

There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.

Sign up to request clarification or add additional context in comments.

Comments

1

The following approach gave a good speed up for me:

import struct
import random
import time


block_size = 4096
start = time.time()

with open('some_file.dat', 'rb') as f_input:    
    data = []

    while True:
        block = f_input.read(block_size * 4)
        data.extend(struct.unpack('>{}f'.format(len(block)/4), block))

        if len(block) < block_size * 4:
            break

print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)

Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.

From the struct - Format Characters documentation:

A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'.

1 Comment

This approach works well for relatively "small" files (1-2 GB). However, and given the title of the question, how would you go about reading a larger file (20-30 GB), in smaller chunks, without running out of memory? (I know it must include some partition of the file to work).
0
def read_big_endian(filename):
    all_text = ""
    with open(filename, "rb") as template:
        try:
            template.read(2)  # first 2 bytes are FF FE
            while True:
                dchar = template.read(2)
                all_text += dchar[0]
        except:
            pass
    return all_text


def save_big_endian(filename, text):
    with open(filename, "wb") as fic:
        fic.write(chr(255) + chr(254))  # first 2 bytes are FF FE
        for letter in text:
            fic.write(letter + chr(0))

Used to read .rdp files

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.