0

+++ WARNING, THE FOLLOWING CONTAINS VERY UGLY PROGRAMMING +++

+++ PLEASE HELP!!! +++

Hey, I am playing around quit a long time with my read in routines and I still not have figured out a good and fast way!

I have something like this: A huge binary file, which I want to slice down to a numpy-array!

I created this structure to read in fromfile a certain amount of bytes:

    mydt = numpy.dtype([
                       ('col1', np.uint64),
                       ('col2', np.int32),
                       ('cols3_56', np.float32, (53,))
                       ])

reading that like this:

data_block = numpy.fromfile(openfile, dtype=mydt, count=ntimes)

What I am getting out is something like this:

[(88000031189210L, 1, [-1000.0, -1000.0, -1000.0, -2.0, -2.0, -2.0, 65004000.0, 0.0, 760680000.0, 0.0, 0.12124349921941757, 0.04971266910433769, 2328.39990234375, 0.00013795999984722584, 0.0, 0.0, -1.0, -1.0, -1.0, 65004000.0, -1.0, 760680000.0, 0.0, 0.0, -1.0, 825680000.0, 0.0, -1.0, -1.0, -1.0, 157630.0, 0.0, 756310.0, 0.0, -1.0, -1.0, 0.0, 5.250500202178955, 0.0, 5.250500202178955, -13.602999687194824, -16.760000228881836, -17.283000946044922, -16.95800018310547, -17.513999938964844, -17.57200050354004, -13.657999992370605, -16.77199935913086, -17.291000366210938, -16.9689998626709, -17.520999908447266, -17.57200050354004, 1.0]), [(88......1L, 1, [-1000.0, ....]), ....

then I extend this datablock to my array

data_block_array.extend(data_block)

... and this million of times ....

I want now to access two things:

  • the 2th element in the above structure (in this example "1") for the entire data array which is a couple of millions times the above mentioned array
  • the 8th (in total the 12th) element in the 53-column data block for the entire array, again millions of substructures!

I figured that out with doing some loops over a count:

 i=0           
 while i<count:
     self.data_array[i,element1] = data_block_array[i][1]
     self.data_array[i,element8] = data_block_array[i][2][13]  

which is incredible slow ... I would like to develop a very fast and easy way to filter my data that way and extract the columns I am interested in. Appreciate some advise and insights!

1

1 Answer 1

1

you can try memmap:

import numpy as np
mydt = np.dtype([
                   ('col1', np.uint64),
                   ('col2', np.int32),
                   ('cols3_56', np.float32, (53,))
                   ])
data = np.zeros(1000, dtype=mydt)
tmp = data.view(np.float32)
tmp[:] = np.random.rand(len(tmp))
data.tofile("tmp.dat")
mm = np.memmap("tmp.dat", mydt, "r")
assert np.all(data["col2"] == np.asarray(mm["col2"]))
assert np.all(data["cols3_56"][7] == np.asarray(mm["cols3_56"][7]))
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.