0

I have one big file saved using numpy in append mode, i.e., it contains maybe 5000 arrays, each with shape, e.g. [1, 224, 224, 3], like this way:

filepath = 'hello'
for some loop:
    ...
    with open(filepath, 'ab') as f:
        np.save(f, ndarray)

I need to load the data in the file, maybe all arrays, or maybe in some generating mode, like reading the first 100, then the next 100, and so on. Is there any method to do this properly? Now, I only know if I use np.load, I can only get one array each time, and I don't know how to read the 100 to 199 arrays.

loading arrays saved using numpy.save in append mode This question talk about something on this, but seems not what I want.

6
  • If you recorded the file sizes during creation, you might be able to use seek to jump ahead to right spot. But as the link shows this is an undocumented feature, and you basically on your own. Commented Jun 29, 2018 at 3:45
  • Thanks @hpaulj, but unfortunately I did not record the size, as it is stored dynamically and record such size might be complicated a little... Is there any better way to solve this? Commented Jun 29, 2018 at 4:01
  • Are you interested in loading individual arrays? Or concatenating arrays along an axis? Commented Jun 29, 2018 at 4:16
  • I want to concatenate them into one array, but I don't know how many arrays each file contains. @HanAltae-Tran Commented Jun 29, 2018 at 4:22
  • Do you control the save loop? Looks like a good place to fix. Especially switching the order of the loop and the with. Commented Jun 29, 2018 at 4:29

2 Answers 2

1

One solution, although ugly and can only get all arrays in the file (and thus risk the out of memory error) is as following:

a = []
with open(filepath, 'rb') as f:
    while True:
        try:
            a.append(np.load(f))
        except:
            break
np.stack(a)
Sign up to request clarification or add additional context in comments.

Comments

1

This is more of a hack (given your situation).

Anyway, here is the one that created the files with np.save in append mode:

import numpy as np

numpy_arrays = [np.array ([1, 2, 3]), np.array([0, 9])]

print numpy_arrays[0], numpy_arrays[1]
print type(numpy_arrays[0]), type(numpy_arrays[1])
for numpy_array in numpy_arrays:
    with open ("./my-numpy-arrays.bin", 'ab') as f:
        np.save(f, numpy_array)

[1 2 3] [0 9]
<type 'numpy.ndarray'> <type 'numpy.ndarray'>

... and here is the code that checks IOException (and other errors) while looping through.

with open ("./my-numpy-arrays.bin", 'rb') as f:
    while True:
        try :   
            numpy_array = np.load(f)
            print numpy_array
        except : 
            break

[1 2 3]
[0 9]

Not very pretty but ... it works.

4 Comments

Thanks @Edward. This is one solution, but have the risk of out of memory error. Think about a numpy array file extremely large ... (like 100 GB? Although I am not using such a large file ...)
Since you do not save the offset of each array in the file, the only way we can do is to seek it this way. Practically speaking, if I were to encounter a similar situation, I would just divide the file into multiple files (if space permits) or I would build an index on the starting position of each member array and seek for subsequent accesses. Either way, you can't get away from reading it once. My loop does not grow as it reuses the variable.
Thanks @Edward. You are right, it will not grow, but if, say I want to dynamically load 100 arrays each time, I might need to reload the first 100 arrays each time I load the following 100 arrays, which is very inefficient. Maybe you are right, it is impossible to do what I desired without properly saving in advance. Thanks again.
bare except is in general a bad idea; you might want to check for EOFError instead (see github.com/numpy/numpy/pull/23105)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.