why numpy narray read from file consumes so much memory?

Question

the file contains 2000000 rows: each row contains 208 columns, separated by comma, like this:

0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0

The program read this file to a numpy narray, I expected it will consume about (2000000 * 208 * 8B) = 3.2GB memory. However, when the program read this file, I found the program consumes about 20GB memory.

I am confused about why my program consumes so much memory that do not meet expectation?

Can you show the exact line of code that reads the data from file? It is hard to answer if we have to guess. — Bas Swinckels
– Bas Swinckels, Commented Oct 26, 2014 at 6:04
@BasSwinckels thank you, i use np.loadtxt() to read data. Saullo Castro has pointed the problem and explained this problem roughly. — 祝方泽
– 祝方泽, Commented Oct 26, 2014 at 9:43

Saullo G. P. Castro · Accepted Answer · 2014-10-26 08:35:04Z

2

I'm using Numpy 1.9.0 and the memory inneficiency of np.loadtxt() and np.genfromtxt() seems to be directly related to the fact they are based on temporary lists to store the data:

see here for np.loadtxt()
and here for np.genfromtxt()

By knowing beforehand the shape of your array you can think of a file reader that will consume an amount of memory very close to the theoretical amount of memory (3.2 GB for this case), by storing the data using the corresponding dtype:

def read_large_txt(path, delimiter=None, dtype=None):
    with open(path) as f:
        nrows = sum(1 for line in f)
        f.seek(0)
        ncols = len(f.next().split(delimiter))
        out = np.empty((nrows, ncols), dtype=dtype)
        f.seek(0)
        for i, line in enumerate(f):
            out[i] = line.split(delimiter)
    return out

edited Oct 26, 2014 at 8:35

answered Oct 26, 2014 at 7:29

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user3666197 Over a year ago

having seen the sample row, there may be a vast memory saving once sparse-matrix would rather be used, doesn't it?

Saullo G. P. Castro Over a year ago

@user3666197 surely yes, but that would require a more complex reader function....

user3666197 Over a year ago

sure, the OP issue seems to be the memory-bound, so this was a direction to tradeoff potentially blocking memory-bound issue for CPU-bound efforts, that make both the input per-se & the further processing feasible on even larger dataSETs ( my gut sense tells the OP is not seeking a one-liner or a few SLOC-s, but a feasible approach to input & process similar batches of data with numpy comfort, so will pay the cost of a bit smarter input-pre-processor )

Saullo G. P. Castro Over a year ago

@user3666197 I've tested here and the problem with np.loadtxt() and also np.genfromtxt() is not knowing the shape, forced to use temporary lists and list.append() (see here and here)

user3666197 Over a year ago

That was out of question, Saullo, as addressed in your Answer, the input-processor related issue. Excuse my remark, it just touched the proper ( a more efficient ) matrix-representation for the dataSET.

|

Community · Accepted Answer · 2017-05-23 11:44:03Z

0

I think you should try pandas to handle big data ( text files). pandas is like a excel in python. And it internally use numpy to represent the data.

HDF5 files also an another method to save big data into hdf5 binary file.

This question would give some idea about how to handle big files - "Large data" work flows using pandas

edited May 23, 2017 at 11:44

CommunityBot

11 silver badge

answered Oct 26, 2014 at 7:17

Haridas N

5496 silver badges21 bronze badges

1 Comment

祝方泽 Over a year ago

i have not used pandas, thank you for your advice, i will learn it.

Collectives™ on Stack Overflow

why numpy narray read from file consumes so much memory?

2 Answers 2

9 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

1 Comment

Linked

Related