8

I want to read a binary file in Python, the exact layout of which is stored in the binary file itself.

The file contains a sequence of two-dimensional arrays, with the row and column dimensions of each array stored as a pair of integers preceding its contents. I want to successively read all of the arrays contained within the file.

I know this can be done with f = open("myfile", "rb") and f.read(numberofbytes), but this is quite clumsy because I would then need to convert the output into meaningful data structures. I would like to use numpy's np.fromfile with a custom dtype, but have not found a way to read part of the file, leaving it open, and then continue reading with a modified dtype.

I know I can use os to f.seek(numberofbytes, os.SEEK_SET) and np.fromfile multiple times, but this would mean a lot of unnecessary jumping around in the file.

In short, I want MATLAB's fread (or at least something like C++ ifstream read).

What is the best way to do this?

4
  • 1
    Could you describe the file format? It's hard to recommend a particular approach without knowing anything about the file itself. Commented Jul 3, 2015 at 23:17
  • 1
    it's a raw binary file, it contains matrices as doubles coming from a c++ program, and integers that describe the size of the matrices Commented Jul 3, 2015 at 23:20
  • Does a single file contain multiple arrays, or is there just one array per file? Are the array dimensions given in a header at the start of the file? Could you describe the header? Commented Jul 3, 2015 at 23:27
  • it's a known number of arrays with unknown dimensions. Before each array, there is two integers describing the dimensions of the array. Commented Jul 3, 2015 at 23:32

1 Answer 1

5

You can pass an open file object to np.fromfile, read the dimensions of the first array, then read the array contents (again using np.fromfile), and repeat the process for additional arrays within the same file.

For example:

import numpy as np
import os

def iter_arrays(fname, array_ndim=2, dim_dtype=np.int, array_dtype=np.double):

    with open(fname, 'rb') as f:
        fsize = os.fstat(f.fileno()).st_size

        # while we haven't yet reached the end of the file...
        while f.tell() < fsize:

            # get the dimensions for this array
            dims = np.fromfile(f, dim_dtype, array_ndim)

            # get the array contents
            yield np.fromfile(f, array_dtype, np.prod(dims)).reshape(dims)

Example usage:

# write some random arrays to an example binary file
x = np.random.randn(100, 200)
y = np.random.randn(300, 400)

with open('/tmp/testbin', 'wb') as f:
    np.array(x.shape).tofile(f)
    x.tofile(f)
    np.array(y.shape).tofile(f)
    y.tofile(f)

# read the contents back
x1, y1 = iter_arrays('/tmp/testbin')

# check that they match the input arrays
assert np.allclose(x, x1) and np.allclose(y, y1)

If the arrays are large, you might consider using np.memmap with the offset= parameter in place of np.fromfile to get the contents of the arrays as memory-maps rather than loading them into RAM.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.