Using memmap files for batch processing

Question

I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA. Therefore, I shifted to using Iterative PCA.

Dataset Size-(140000,3504)

The documentation states that This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.

This is really good, but unsure on how take advantage of this.

I tried load one memmap hoping it would access it in chunks but my RAM blew. My code below ends up using a lot of RAM:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

When I say "my RAM blew", the Traceback I see is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

How can I improve this without comprising on accuracy by reducing the batch-size?

My ideas to diagnose:

I looked at the sklearn source code and in the fit() function Source Code I can see the following. This makes sense to me, but I am still unsure about what is wrong in my case.

for batch in gen_batches(n_samples, self.batch_size_):
        self.partial_fit(X[batch])
return self

Edit: Worst case scenario I will have to write my own code for iterativePCA which batch processes by reading and closing .npy files. But that would defeat the purpose of taking advantage of already present hack.

Edit2: If somehow I could delete a batch of processed memmap file. It would make much sense.

Edit3: Ideally if IncrementalPCA.fit() is just using batches it should not crash my RAM. Posting the whole code, just to make sure I am not making a mistake in flushing the memmap completely to disk before.

temp_train_data=X_train[1000:]
temp_labels=y[1000:] 
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
    actual_index=index+1000
    data=X_train[actual_index-1000:actual_index+1].ravel()
    __,cd_i=pywt.dwt(data,'haar')
    out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)

Surprisingly, I note out.flush doesn't free my memory. Is there a way to using del out to free my memory completely and then someone pass a pointer of the file to IncrementalPCA.fit().

1) Why don't you specified desired number of features after transformation? I mean after transformation you probably will get same dataset, but in RAM (because it was generated by transform), thus it consume big amount of RAM. Specify n_components parameter in constructor. 2) Even if you specified n_components (which must be less or equal than number of features in dataset), maybe it will not fit into memory, because you are trying to compute transformed dataset in one turn. Maybe you need to transform it by batches, calling transform method on every batch, and saving transformed data to HDD. — Ibraim Ganiev
– Ibraim Ganiev, Commented Aug 27, 2015 at 12:12
@Olologin Thanks for the comment! I tried just clf = IncrementalPCA().fit(X_train_mmap) and it crashes my RAM. I am keen to save 98% variance. — Abhishek Bhatia
– Abhishek Bhatia, Commented Aug 27, 2015 at 16:19
@AbhishekBhatia, try to use IncrementalPCA(n_components=1).fit(X_train_mmap). Is it completes successfully? — Ibraim Ganiev
– Ibraim Ganiev, Commented Aug 27, 2015 at 17:19
You are missing some really crucial information in the question - making it hard for @Olologin to help. Firstly - you say "blew my RAM". You should put the full traceback. It would also be helpful to say that the MemoryError occurs instantly on the call to fit - not after some heavy processing. I know you are trying to help by saving space and showing you have thought about where the problem might be, but always put the traceback. Can you check the Traceback I have edited into your question matches what you see? — J Richard Snape
– J Richard Snape, Commented Aug 28, 2015 at 11:23

J Richard Snape · Accepted Answer · 2015-08-28 11:29:50Z

You have hit a problem with sklearn in a 32 bit environment. I presume you are using np.float16 because you're in a 32 bit environment and you need that to allow you to create the memmap object without numpy thowing errors.

In a 64 bit environment (tested with Python3.3 64 bit on Windows), your code just works out of the box. So, if you have a 64 bit computer available - install python 64-bit and numpy, scipy, scikit-learn 64 bit and you are good to go.

Unfortunately, if you cannot do this, there is no easy fix. I have raised an issue on github here, but it is not easy to patch. The fundamental problem is that within the library, if your type is float16, a copy of the array to memory is triggered. The detail of this is below.

So, I hope you have access to a 64 bit environment with plenty of RAM. If not, you will have to split up your array yourself and batch process it, a rather larger task...

N.B It's really good to see you going to the source to diagnose your problem :) However, if you look at the line where the code fails (from the Traceback), you will see that the for batch in gen_batches code that you found is never reached.

Detailed diagnosis:

The actual error generated by OP code:

import numpy as np
from sklearn.decomposition import IncrementalPCA

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

is

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

The call to check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs), to satisfy the requirement that the dtype is different; therefore a copy is made by np.array.

This could be solved in the IncrementalPCA code by ensuring that the dtype was preserved for arrays with dtype in (np.float16, np.float32, np.float64). However, when I tried that patch, it only pushed the MemoryError further along the chain of execution.

The same copying problem occurs when the code calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd(), a wrapped native function from lapack. Thus, I do not think there is a way to patch this (at least not an easy way - it is at minimum alteration of code in core scipy).

This is a good remark. It's probably better to memory map data stored with dtype=np.float64 to begin with in that case. Alternatively it's also possible to explicitly unroll the fitting my calling partial_fit manually with explicitly loaded data on the go and forget about memory mapping.
@ogrisel yes - you can start with dtype=np.float64, but then the OP (if in a 32 bit environment) can't create a memmap with that many members. At least I think that's the case, I'm struggling a little to understand the exact use case. I think you're right to suggest chunking up the data 'manually' and passing it to partial_fit.
@AbhishekBhatia Does this answer the question? If so, I'd appreciate an accept. I know it probably doesn't enable you to do what you want, but it does explain in detail why it fails. See the linked github issue for any progress. Note that the patches I suggest there will allow you to use fit(), but not fit_transform().

ogrisel · Accepted Answer · 2015-08-27 12:32:57Z

1

Does the following alone trigger the crash?

X_train_mmap = np.memmap('my_array.mmap', dtype=np.float16,
                         mode='w+', shape=(n_samples, n_features))
clf = IncrementalPCA(n_components=50).fit(X_train_mmap)

If not then you can use that model to transform (project your data iteratively) to a smaller data using batches:

X_projected_mmap = np.memmap('my_result_array.mmap', dtype=np.float16,
                             mode='w+', shape=(n_samples, clf.n_components))
for batch in gen_batches(n_samples, self.batch_size_):
    X_batch_projected = clf.transform(X_train_mmap[batch])
    X_projected_mmap[batch] = X_batch_projected

I have not tested that code but I hope that you get the idea.

answered Aug 27, 2015 at 12:32

ogrisel

40.3k14 gold badges120 silver badges125 bronze badges

3 Comments

Abhishek Bhatia Over a year ago

Nice Idea!! This tends to work better but why the specify the number of components. How does that effect the memory?

Abhishek Bhatia Over a year ago

Is there a way to completely flush memmap from memory in Python and somehow just store a pointer? I notice memmap_object.flush() and del memmap_object have different effects. I want to pass just pointer to IncrementalPCA if possible.

J Richard Snape Over a year ago

Unfortunately, even with the smallest possible (probably ridiculous) n_components=1, the call to fit in IncrementalPCA will likely fail when the data array (even when using memmap) is this size. See my answer below for why (maybe OP can verify that setting n_components=1 - it fails on my machine given the same setup). Hopefully the OP has access to a 64 bit computer with plenty of RAM.

Collectives™ on Stack Overflow

Using memmap files for batch processing

2 Answers 2

Detailed diagnosis:

3 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Detailed diagnosis:

3 Comments

3 Comments

Linked

Related