Efficient Way to Create Numpy Arrays from Binary Files

Question

I have very large datasets that are stored in binary files on the hard disk. Here is an example of the file structure:

File Header

149 Byte ASCII Header

Record Start

4 Byte Int - Record Timestamp

Sample Start

2 Byte Int - Data Stream 1 Sample
2 Byte Int - Data Stream 2 Sample
2 Byte Int - Data Stream 3 Sample
2 Byte Int - Data Stream 4 Sample

Sample End

There are 122,880 Samples per Record and 713 Records per File. This yields a total size of 700,910,521 Bytes. The sample rate and number of records does vary sometimes so I have to code for detection of the number of each per file.

Currently the code I use to import this data into arrays works like this:

from time import clock
from numpy import zeros , int16 , int32 , hstack , array , savez
from struct import unpack
from os.path import getsize

start_time = clock()
file_size = getsize(input_file)

with open(input_file,'rb') as openfile:
  input_data = openfile.read()

header = input_data[:149]
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

time_series = zeros(0,dtype=int32)
t_series = zeros(0,dtype=int16)
x_series = zeros(0,dtype=int16)
y_series = zeros(0,dtype=int16)
z_series = zeros(0,dtype=int16)

for record in xrange(number_of_records):

  time_stamp = array( unpack( '<l' , input_data[ 149 + (record * record_size) : 149 + (record * record_size) + 4 ] ) , dtype = int32 )
  unpacked_record = unpack( '<' + str(sample_rate * 4) + 'h' , input_data[ 149 + (record * record_size) + 4 : 149 + ( (record + 1) * record_size ) ] ) 

  record_t = zeros(sample_rate , dtype=int16)
  record_x = zeros(sample_rate , dtype=int16)
  record_y = zeros(sample_rate , dtype=int16)
  record_z = zeros(sample_rate , dtype=int16)

  for sample in xrange(sample_rate):

    record_t[sample] = unpacked_record[ ( sample * 4 ) + 0 ]
    record_x[sample] = unpacked_record[ ( sample * 4 ) + 1 ]
    record_y[sample] = unpacked_record[ ( sample * 4 ) + 2 ]
    record_z[sample] = unpacked_record[ ( sample * 4 ) + 3 ]

  time_series = hstack ( ( time_series , time_stamp ) )
  t_series = hstack ( ( t_series , record_t ) )
  x_series = hstack ( ( x_series , record_x ) )
  y_series = hstack ( ( y_series , record_y ) )
  z_series = hstack ( ( z_series , record_z ) )

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, time=time_series)
end_time = clock()
print 'Total Time',end_time - start_time,'seconds'

This currently takes about 250 seconds per 700 MB file, which to me seems very high. Is there a more efficient way I could do this?

Final Solution

Using the numpy fromfile method with a custom dtype cut the runtime to 9 seconds, 27x faster than the original code above. The final code is below.

from numpy import savez, dtype , fromfile 
from os.path import getsize
from time import clock

start_time = clock()
file_size = getsize(input_file)

openfile = open(input_file,'rb')
header = openfile.read(149)
record_size = int(header[23:31])
number_of_records = ( file_size - 149 ) / record_size
sample_rate = ( ( record_size - 4 ) / 4 ) / 2

record_dtype = dtype( [ ( 'timestamp' , '<i4' ) , ( 'samples' , '<i2' , ( sample_rate , 4 ) ) ] )

data = fromfile(openfile , dtype = record_dtype , count = number_of_records )
time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

savez(output_file, t=t_series , x=x_series ,y=y_series, z=z_series, fid=time_series)

end_time = clock()

print 'It took',end_time - start_time,'seconds'

Is it medical data? EDF? If you don't know what I'm talking about, nevermind... ;o) Anyway, take a look at my answer, which I use to open medical data binary files according to this question: stackoverflow.com/q/5804052/401828 . There is an interesting discussion there. — heltonbiker
– heltonbiker, Commented Sep 27, 2011 at 13:40
No the data is geophysical. I saw your question while researching before posting. Your data consists of nothing but short ints, where I unfortunately have the 4 byte int timestamps scattered throughout the stream. — Stu
– Stu, Commented Sep 27, 2011 at 17:04
For what its worth, many operations ON numpy structured arrays are much much slower than on regular numpy arrays. Import time might be faster, but calculations might take 10-100 times longer :( — J.J
– J.J, Commented May 8, 2015 at 17:39

pv. · Accepted Answer · 2011-09-27 22:53:46Z

Some hints:

Don't use the struct module. Instead, use Numpy's structured data types and fromfile. Check here: http://scipy-lectures.github.com/advanced/advanced_numpy/index.html#example-reading-wav-files
You can read all of the records at once, by passing in a suitable count= to fromfile.

Something like this (untested, but you get the idea):

import numpy as np

file = open(input_file, 'rb')
header = file.read(149)

# ... parse the header as you did ...

record_dtype = np.dtype([
    ('timestamp', '<i4'), 
    ('samples', '<i2', (sample_rate, 4))
])

data = np.fromfile(file, dtype=record_dtype, count=number_of_records)
# NB: count can be omitted -- it just reads the whole file then

time_series = data['timestamp']
t_series = data['samples'][:,:,0].ravel()
x_series = data['samples'][:,:,1].ravel()
y_series = data['samples'][:,:,2].ravel()
z_series = data['samples'][:,:,3].ravel()

Total time now is 9.07 seconds including savez! Thank you. I will update the question with the final code.
Great answer, excellent use of numpy built-in functionality! +1

NPE · Accepted Answer · 2011-09-27 13:11:48Z

One glaring inefficiency is the use of hstack in a loop:

  time_series = hstack ( ( time_series , time_stamp ) )
  t_series = hstack ( ( t_series , record_t ) )
  x_series = hstack ( ( x_series , record_x ) )
  y_series = hstack ( ( y_series , record_y ) )
  z_series = hstack ( ( z_series , record_z ) )

On every iteration, this allocates a slightly bigger array for each of the series and copies all the data read so far into it. This involves lots of unnecessary copying and can potentially lead to bad memory fragmentation.

I'd accumulate the values of time_stamp in a list and do one hstack at the end, and would do exactly the same for record_t etc.

If that doesn't bring sufficient performance improvements, I'd comment out the body of the loop and would start bringing things back in one a time, to see where exactly the time is spent.

Okay well implementing this has cut the time down to 110 seconds!! Thanks, I am going to try implementing some of the other suggestions as well.
Also of the 110 seconds now, about 40 are for the savez function which I can not optimize. Though for comparison loading the .npz takes only 20 seconds.
I must have been wrong about savez as using the fromfile method with a custom dtype the time including savez dropped to 9 seconds.

talonmies · Accepted Answer · 2011-09-27 13:28:57Z

2

Numpy supports mapping binary from data directly into array like objects via numpy.memmap. You might be able to memmap the file and extract the data you need via offsets.

For endianness correctness just use numpy.byteswap on what you have read in. You can use a conditional expression to check the endianness of the host system:

if struct.pack('=f', np.pi) == struct.pack('>f', np.pi):
  # Host is big-endian, in-place conversion
  arrayName.byteswap(True)

edited Sep 27, 2011 at 13:28

answered Sep 27, 2011 at 13:08

talonmies

72.7k35 gold badges204 silver badges296 bronze badges

2 Comments

Stu Over a year ago

I have looked at that, but there appears to be no way to specify the endian-ness of the data. The code needs to work under both windows and unix so the endian-ness needs to be explicitly stated.

talonmies Over a year ago

You can use numpy.byteswap to set the correct endianness of the data in-place, if required. See edit.

heltonbiker · Accepted Answer · 2011-09-28 13:55:03Z

I have got satisfactory results with a similar problem (multi-resolution multi-channel binary data files) by using array, and struct.unpack. In my problem, I wanted continuous data for each channel, but the file had an interval oriented structure, instead of a channel oriented structure.

The "secret" is to read the whole file first, and only then distribute the known-sized slices to the desired containers (on the code below, self.channel_content[channel]['recording'] is an object of type array):

f = open(somefilename, 'rb')    
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(wholefilename)/2 - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
    for channel in self.channel_labels:
        samples = int(self.channel_content[channel]['nsamples'])
        self.channel_content[channel]['recording'].extend(fullsamples[position:position+samples])
            position += samples

Of course, I cannot state this is better or faster than other answers provided, but at least is something you might evaluate.

Hope it helps!

Collectives™ on Stack Overflow

Efficient Way to Create Numpy Arrays from Binary Files

Final Solution

4 Answers 4

3 Comments

3 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Final Solution

4 Answers 4

3 Comments

3 Comments

2 Comments

Comments

Linked

Related