4

I have a binary file that contains 32-bit floats. I need to be able to read certain sections of the file into a list or other array-like structure. In other words, I need to read a specific number of bytes (specific number of float32s) at a time into my data structure, then use seek() to seek to another point in the file and do the same thing again.

In pseudocode:

new_list = []

with open('my_file.data', 'rb') as file_in:
    for idx, offset in enumerate(offset_values):
        # seek in the file by the offset
        # read n float32 values into new_list[idx][:]

What is the most efficient/least confusing way to do this?

1
  • Use numpy.memmap to memory-map the file as a numpy array with dtype numpy.float32. Commented Nov 21, 2018 at 19:08

2 Answers 2

3

You can convert bytes to and from 32-bit float values using the struct module:

import random
import struct

FLOAT_SIZE = 4
NUM_OFFSETS = 5
filename = 'my_file.data'

# Create some random offsets.
offset_values = [i*FLOAT_SIZE for i in range(NUM_OFFSETS)]
random.shuffle(offset_values)

# Create a test file
with open(filename, 'wb') as file:
    for offset in offset_values:
        file.seek(offset)
        value = random.random()
        print('writing value:', value, 'at offset', offset)
        file.write(struct.pack('f', value))

# Read sections of file back at offset locations.

new_list = []
with open(filename, 'rb') as file:
    for offset in offset_values:
        file.seek(offset)
        buf = file.read(FLOAT_SIZE)
        value = struct.unpack('f', buf)[0]
        print('read value:', value, 'at offset', offset)
        new_list.append(value)

print('new_list =', new_list)

Sample output:

writing value: 0.0687244786128608 at offset 8
writing value: 0.34336034914481284 at offset 16
writing value: 0.03658244351244533 at offset 4
writing value: 0.9733690320097427 at offset 12
writing value: 0.31991994765615206 at offset 0
read value: 0.06872447580099106 at offset 8
read value: 0.3433603346347809 at offset 16
read value: 0.03658244386315346 at offset 4
read value: 0.9733690023422241 at offset 12
read value: 0.3199199438095093 at offset 0
new_list = [0.06872447580099106, 0.3433603346347809, 0.03658244386315346,
            0.9733690023422241, 0.3199199438095093]

Note the values read back are slightly different because internally Python uses 64-bit float values, so some precision got lost in the process of converting them to 32-bits and then back.

Sign up to request clarification or add additional context in comments.

4 Comments

This looks very promising. What if I need to read multiple floats at a time (i.e. a whole line of values into a line of my list)? Would I use a for loop containing struct.unpack('f', buf)[0] to run the struct.unpack operation as many times as values I need from the line?
@questionable_code: Yes, you could do it in a for loop, but it would be much more efficient to use the struct.unpack() function to do it since it's capable of unpacking multiple values each time it's called if you give it the proper format string (i.e. '4f' for four of them). Note that strictly-speaking there are no "lines" in a binary file, so to use it that way after a seek() to the beginning of the group, you would then need to read in the desired number of FLOAT_SIZE bytes into the buf buffer.
What if the number of values I need is variable? How would I write the format string for that?
@questionable_code: The required format string could easily be constructed on-the-fly if you know the number of 32-bit floats expected at each offset.
0

The binary information from your input file can readily be mapped to virtual memory using mmap. From there, you can import the buffer into a numpy array, if desired. One note - the numpy dtype may change depending on whether your 32 bit floats are signed or unsigned (this example assumes signed). The array that get populated will contain the numbers (as opposed to the raw bytes).

import mmap
import numpy as np
import os

new_list = []

with open('my_file.data', 'rb') as file_in:
    size_bytes = os.fstat(file_in.fileno()).st_size
    m = mmap.mmap(file_in.fileno(), length=size_bytes, access=mmap.ACCESS_READ)
    arr = np.frombuffer(m, np.dtype('float32'), offset=0)
    for idx, offset in enumerate(offset_values):
        new_list.append(arr[offset//4])  #For unsigned 32bit floats, divide by 8

I tested this with an n=10000 array of random floats, converted to bytes:

import random
import struct

a = ''
for i in range(10000):
    a += struct.pack('<f', random.uniform(0, 1000))

Then I read this "a" variable into the numpy array, as you would with the binary information from file.

>>> arr = np.frombuffer(a, np.dtype('float32'), offset=0)
>>> arr[500]
634.24408

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.