7

I have a binary file made from C structs that I want to parse in Python. I know the exact format and layout of the binary but I am confused on how to use Python Struct unpacking to read this data.

Would I have to traverse the whole binary unpacking a certain number of bytes at a time based on what the members of the struct are?

C File Format:

typedef struct {
  int data1;
  int data2;
  int data4;
} datanums;

typedef struct {
  datanums numbers;
  char *name;
 } personal_data;

Lets say the binary file had personal_data structs repeatedly after another.

4
  • 2
    It would be useful if you could give us an example of the file format. Commented May 4, 2015 at 4:52
  • The short answer is, yes, if your file is a sequence of the same struct over and over again, you unpack (or, maybe simpler, unpack_from) one copy of the struct at a time until you're done. Commented May 4, 2015 at 4:57
  • 1
    That personal_data struct isn't something you can pack directly in a file. Or at least not usefully, because the char * is just going to get stored as 8 bytes that don't mean anything outside the original program's memory space. So you still need to find out how the C program is actually writing the data. Commented May 4, 2015 at 5:23
  • For example, if might store the string itself null-terminated in place of the pointer. Or it might store all the strings in one block and all the structs in another, replacing the pointers with file offsets. Or plenty of other possibilities. Commented May 4, 2015 at 5:25

2 Answers 2

5

Assuming the layout is a static binary structure that can be described by a simple struct pattern, and the file is just that structure repeated over and over again, then yes, "traverse the whole binary unpacking a certain number of bytes at a time" is exactly what you'd do.

For example:

record = struct.Struct('>HB10cL')

with open('myfile.bin', 'rb') as f:
    while True:
        buf = f.read(record.size)
        if not buf:
            break
        yield record.unpack(buf)

If you're worried about the efficiency of only reading 17 bytes at a time and you want to wrap that up by buffering 8K at a time or something… well, first make sure it's an actual problem worth optimizing; then, if it is, loop over unpack_from instead of unpack. Something like this (untested, top-of-my-head code):

buf, offset = b'', 0
with open('myfile.bin', 'rb') as f:
    if len(buf) < record.size:
        buf, offset = buf[offset:] + f.read(8192), 0
        if not buf:
            break
    yield record.unpack_from(buf, offset)
    offset += record.size

Or, even simpler, as long as the file isn't too big for your vmsize, just mmap the whole thing and unpack_from on the mmap itself:

with open('myfile.bin', 'rb') as f:
    with mmap.mmap(f, 0, access=mmap.ACCESS_READ) as m:
        for offset in range(0, m.size(), record.size):
            yield record.unpack_from(m, offset)
Sign up to request clarification or add additional context in comments.

4 Comments

Is the '>HB10cL' the format of the struct in the binary? USing the format string types specified in docs.python.org/2/library/struct.html#struct-format-strings ?
@user1224478: Exactly. I just made up a silly random format string, since you didn't tell us the actual format you want. (If you're not familiar with the Struct class API and would rather use struct.unpack(fmt, buf) and struct.calcsize(fmt) instead of record.unpack(buf) and record.size… well, they're doing the exact same thing under the covers, so feel free to do so.)
I now understand the formatting, but what if I want to read the binary in this order: 4 bytes, 4 bytes, 16 bytes, 2 bytes, 2 bytes, 256 bytes. Is it best to read the 16 bytes and 256 bytes in smaller increments and join them together?
@user1224478: You can read 4, 4, 16, 2, 2, 256, or you can join them all together into one structure and read 284 bytes at a time, whichever seems clearer in your code—but there's no reason to break those 16 and 256 up into smaller increments (unless the struct patterns for those are too complicated for you to make sense of).
2

You can unpack a few at a time. Let's start with this example:

In [44]: a = struct.pack("iiii", 1, 2, 3, 4)

In [45]: a
Out[45]: '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00'

If you're using a string, you can just use a subset of it, or use unpack_from:

In [49]: struct.unpack("ii",a[0:8])
Out[49]: (1, 2)
In [55]: struct.unpack_from("ii",a,0)
Out[55]: (1, 2)
In [56]: struct.unpack_from("ii",a,4)
Out[56]: (2, 3)

If you're using a buffer, you'll need to use unpack_from.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.