4

I am loading a csv file via numpy.loadtxt into a numpy array. My data has about 1 million records and 87 columns. While the object.nbytes is only 177159666 bytes, it actually takes much more meomory because I get 'MemoryError' while training a Decision Tree using scikit-learn. Also, after reading the data, the available memory in my system reduced by 1.8 gigs. I am working on linux machine with 3 gigs of memory. So does object.nbytes returns the real memory usage of an numpy array?

train = np.loadtxt('~/Py_train.csv', delimiter=',', skiprows=1, dtype='float16')
4
  • 1
    So, is there a question that you have? Commented Aug 2, 2012 at 15:01
  • 2
    Here's a related question: stackoverflow.com/questions/11527964/… . Basically, np.loadtxt takes up LOTS of memory because it first stores the data in lists and then converts those to an ndarray. (increasing memory usage by a factor of 3 or 4 at least). If you know the size, you might want to consider pre-allocating the array and parsing it yourself. Also, don't be afraid to look at the source for np.loadtxt. It's reasonably comprehendable. Commented Aug 2, 2012 at 15:03
  • @Marcin, just updated my question. Commented Aug 2, 2012 at 15:05
  • Thanks, @mgilson. Now I can understand the large peak memory usage. Do you find the nbytes attribute for ndarray accurate for estimating its memory usage? Commented Aug 2, 2012 at 15:20

3 Answers 3

6

I had a similar problem when trying to create a large 400,000 x 100,000 matrix. Fitting all of that data into an ndarray is impossible.

However, the big insight I came up with was that most of the values in the matrix are empty, and thus this can be represented as a sparse matrix. Sparse matrices are useful because it is able to represent the data using less memory. I used Scipy.sparse's sparse matrix implementation, and I'm able to fit this large matrix in-memory.

Here is my implementation:

https://github.com/paolodm/Kaggle/blob/master/mdschallenge/buildmatrix.py

Sign up to request clarification or add additional context in comments.

Comments

3

Probably, better performance is by using numpy.fromiter:

In [30]: numpy.fromiter((tuple(row) for row in csv.reader(open('/tmp/data.csv'))), dtype='i4,i4,i4')
Out[30]: 
array([(1, 2, 3), (4, 5, 6)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

where

$ cat /tmp/data.csv 
1,2,3
4,5,6

Alternatively, I strongly suggest you to use pandas: it's based on numpy and has many utility functions to do statistical analysis.

Comments

0

I just had the same problem:

My saved .npy file is 752M (on disk), and arr.nbytes = 701289568 (~669M); but np.load take 2.7g memory, i.e. 4x time the actual memory needed

https://github.com/numpy/numpy/issues/17461

and it turns out:

the data array contains mixed (small amount of) strings and (large amount of) numbers.

But each of those 8-byte locations points to a python object, and that object takes at least 24 bytes plus either space for the number or the string.

so, in memory (8-byte pointer + 24-bytes) ~= 4x times of mostly 8-byte (double number) in the file.

NOTE: np.save() and np.load() is not symmetric:

-- np.save() save the numeric type as scalar data, hence the disk file size is consistent with data size user have in mind, and is small

-- np.load() load the numeric type as PyObject, and inflate the memory usage 4x than the user expected.

This is the same for other file format, e.g csv files.

Conclusion: do not use mixed types (string as np.object, and np.numbers) in a np array. Use homogenous numeric type, e.g. np.double. Then memory will take about the same space as the dump disk file.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.