How to merge very large numpy arrays?

Question

I will have many Numpy arrays stored in npz files, which are being saved using savez_compressed function.

I am splitting the information in many arrays because, if not, the functions I am using crash due to memory issues. The data is not sparse.

I will need to joint all that info in one unique array (to be able to process it with some routines), and store it into disk (to process it many times with diffente parameters).

Arrays won't fit into RAM+swap memory.

How to merge them into an unique array and save it to a disk?

I suspect that I should use mmap_mode, but I do not realize exactly how. Also, I imagine that can be some performance issues if I do not reserve contiguous disk space at first.

I have read this post but I still cannot realize how to do it.

EDIT

Clarification: I have made many functions to process similar data, some of them require an array as argument. In some cases I could pass them only part of this large array by using slicing. But it is still important to have all the info. in such an array.

This is because of the following: The arrays contain information (from physical simulations) time ordered. Among the argument of the functions, the user can set the initial and last time to process. Also, he/she can set the size of the processing chunk (which is important because this affect to the performance but allowed chunk size depend on the computational resources). Because of this, I cannot store the data as separated chunks.

The way in which this particular array (the one I am trying to create) is built is not important while it works.

You can't mmap compressed arrays. I think the current np.load implementation just ignores mmap_mode if you try. — user2357112
– user2357112, Commented Jun 7, 2018 at 17:09
Must you merge them into a single array, or can you just load them chunk by chunk, process them chunk by chunk, and write them out chunk by chunk? — Linuxios
– Linuxios, Commented Jun 7, 2018 at 17:16
@user1420303: I see. You can still used chunked data by looking at the time range, finding the corresponding chunks, and loading those, and slicing the first and last chunks if necessary. It's a little more logic, but it prevents you from running out of memory. You could even abstract it into some kind of streaming collection class allowing transparent array indexing and hiding that logic. — Linuxios
– Linuxios, Commented Jun 7, 2018 at 17:36
@user1420303: I was unaware of that feature. That seems reasonable, yes. I'm not entirely sure how it can be done, however. — Linuxios
– Linuxios, Commented Jun 7, 2018 at 17:52

javidcf · Accepted Answer · 2018-06-08 15:34:36Z

1

You should be able to load chunk by chunk on a np.memap array:

import numpy as np

data_files = ['file1.npz', 'file2.npz2', ...]

# If you do not know the final size beforehand you need to
# go through the chunks once first to check their sizes
rows = 0
cols = None
dtype = None
for data_file in data_files:
    with np.load(data_file) as data:
        chunk = data['array']
        rows += chunk.shape[0]
        cols = chunk.shape[1]
        dtype = chunk.dtype

# Once the size is know create memmap and write chunks
merged = np.memmap('merged.buffer', dtype=dtype, mode='w+', shape=(rows, cols))
idx = 0
for data_file in data_files:
    with np.load(data_file) as data:
        chunk = data['array']
        merged[idx:idx + len(chunk)] = chunk
        idx += len(chunk)

However, as pointed out in the comments working across a dimension which is not the fastest one will be very slow.

edited Jun 8, 2018 at 15:34

answered Jun 8, 2018 at 12:44

javidcf

59.9k7 gold badges87 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1420303 Over a year ago

Thank you for your answer. I gives me some ideas. I do not get how the code loads the multiple preexisting npz files.

javidcf Over a year ago

@user1420303 With data.iteritems you go through the arrays in the file, and with sorted(data.keys()) you go through the names in the array (I'm assuming they should be sorted alphabetically, but could be something else).

user1420303 Over a year ago

Right. As I understand the code, it reads 'one' npz file with multiple arrays insede, and merge them. I need to read 'many' npz files, each one containing one array, and merge them.

javidcf Over a year ago

@user1420303 Ahh, I see, okay, I was not understanding correctly. I changed it now.

max9111 · Accepted Answer · 2018-06-08 11:47:45Z

This would be an example how to write a 90GB of easily compressible data to disk. The most important points are mentioned here https://stackoverflow.com/a/48405220/4045774

The write/read speed should be in the range of (300 MB/s,500MB/s) on a nomal HDD.

Example

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time

def read_the_arrays():
  #Easily compressable data
  #A lot smaller than your actual array, I do not have that much RAM
  return np.arange(10*int(15E3)).reshape(10,int(15E3))

def writing(hdf5_path):
  # As we are writing whole chunks here this isn't realy needed,
  # if you forget to set a large enough chunk-cache-size when not writing or reading 
  # whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
  f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
  dset = f.create_dataset("your_data", shape=(int(15E5),int(15E3)),dtype=np.float32,chunks=(10000,100),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

  #Lets write to the dataset
  for i in range(0,int(15E5),10):
    dset[i:i+10,:]=read_the_arrays()

  f.close()

def reading(hdf5_path):
  f = h5c.File(hdf5_path, 'r',chunk_cache_mem_size=1024**2*1000) #1000 MB cache size
  dset = f["your_data"]

  #Read chunks
  for i in range(0,int(15E3),10):
    data=np.copy(dset[:,i:i+10])
  f.close()

hdf5_path='Test.h5'
t1=time.time()
writing(hdf5_path)
print(time.time()-t1)
t1=time.time()
reading(hdf5_path)
print(time.time()-t1)

Thank you. The write speed is just fine. I need to think about the code. I am not familiar with hdf5. Q: You do: 'dset = f.create_dataset' and then 'dset[i:i+10,:]=read_the_arrays()' many times. The whole array is never in RAM, rigth?
Yes, the read_the_arrays() function should simply imitate the reading process from your npz files. So the max. RAM usage should be the size of one InputArray+ the chunk-chache-size, which I have set to 1000MB. This can also be lower, but if you have too less cache, the performance will decrease drastically.
Nice, that makes me simple to solve another problem (some time values are repeated in arrays, that is, there is a little of superposition). Do you think that the final .h5 file can be simply converted to .npz?
If repeated values occur within one chunk, it will be well handled by the compression algorithm. A conversion from a HDF5-file, which is to big to fit in memory to a compressed numpy file is possible, but not so straight forward. (Writing something in chunks to a zip file).

Collectives™ on Stack Overflow

How to merge very large numpy arrays?

2 Answers 2

4 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Linked

Related