3

I have two numpy arrays of huge size. Each array has the shape of (7, 960000, 200). I want to concatenate them using np.concatenate((arr1, arr2), axis=1) so that the final shape would be (7, 1920000, 200). The problem is, they already filled up my ram, and there is no enough room in the ram to do the concatenation operation, hence, the execution is killed. Same thing for the np.stack. So, I thought of making a new array which points to the two arrays in order, and this new array should have the same effect as combining the arrays; they should be contiguous as well.

So, how to do so? And, is there a better way to combining them than the idea I suggested?

4
  • Does this answer your question? Concatenate Numpy arrays without copying Commented May 24, 2022 at 17:38
  • 1
    This isn't really possible. Arrays are stored in single contiguous blocks of memory, and you would have to define a whole new class if you wanted to perform operations on a list of two arrays (and it would defeat the purpose of an array to be indexed very efficiently). Like in the question linked in the other comment, preallocating is the best solution if possible. Commented May 24, 2022 at 17:38
  • np.stack uses concatenate; it just tweaks the dimensions. Same for the other stacks. An array that 'points' to other arrays must be object dtype, and is essentially a list. They won't be contiquous. Commented May 24, 2022 at 17:39
  • You mention one solution. The only alternative is to play with virtual memory for example by memory-map the array to a storage device so the array do not fit in RAM anymore. Note that this can be much slower than working in RAM especially for non-contiguous accesses or on HDDs. Commented May 24, 2022 at 18:35

1 Answer 1

2

Numpy numpy.memmap() allows for the creation of memory mapped data stored as a binary on disk that can be accessed and interfaced with as if it were a single array. This solution saves the individual arrays you are working with as separate .npy files and then combines them into a single binary file.

import numpy as np
import os

size = (7,960000,200)

# We are assuming arrays a and b share the same shape, if they do not 
# see https://stackoverflow.com/questions/50746704/how-to-merge-very-large-numpy-arrays
# for an explanation on how to create the new shape

a = np.ones(size) # uses ~16 GB RAM
a = np.transpose(a, (1,0,2))
shape = a.shape
shape[0] *= 2
dtype = a.dtype

np.save('a.npy', a)
a = None # allows for data to be deallocated by garbage collector

b = np.ones(size) # uses ~16 GB RAM
b = np.transpose(b, (1,0,2))
np.save('b.npy', a)
b = None

# Once the size is know create memmap and write chunks
data_files = ['a.npy', 'b.npy']
merged = np.memmap('merged.dat', dtype=dtype, mode='w+', shape=shape)
i = 0
for file in data_files:
    chunk = np.load(file, allow_pickle=True)
    merged[i:i+len(chunk)] = chunk
    i += len(chunk)

merged = np.transpose(merged, (1,0,2))

# Delete temporary numpy .npy files
os.remove('a.npy')
os.remove('b.npy')
Sign up to request clarification or add additional context in comments.

6 Comments

Thank for your reply. I like the idea. I tried it but I have some questions/problems with it. For the questions: 1) Arrays a and b are the arrays I want to store, right? If so, I don't need to set new arrays or give them size, since my arrays are already there and have my values? Do I get that right? 2) The np.memmap initializes the memorymap, but it is empty until the loops give it the values in the npy files, right?
yeah just use arr1 instead of a and arr2 instead of b. Also yeah np.memmap() isnt usable until you set it with the contents of arr1 and arr2, but afterwards its basically a massive array with the contents of arr1 and arr2 inside of it
I just created two 16 GB arrays a and b to test it out on my system, you wouldn't use them or set the size yourself since you already have your arrays
For the problems: Both arrays have the shape of (7,960000,200), and it was mentioned that the shape of the memmap should be the final shape. The shape required is (7,1920000,200) which is the shape you would get when concatenating them on axis=1. But you have put the shape of 1 of them, not the final. Regardless of which shape I choose, I get errors. If I choose the shape of one of them, I get "could not broadcast input array from shape (7,960000,200) into shape (0,960000,200)", and when using final shape "could not broadcast input array from shape (7,960000,200) into shape (7,1920000,200)"
And final 2 questions. Will it combine both arrays as if I concatinated them? And, the merged variable now is the variable I use when manipulating the array? and can I use any np operations (using predefined methods) on it as if it was loaded to memory?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.