6

the file contains 2000000 rows: each row contains 208 columns, separated by comma, like this:

0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0

The program read this file to a numpy narray, I expected it will consume about (2000000 * 208 * 8B) = 3.2GB memory. However, when the program read this file, I found the program consumes about 20GB memory.

I am confused about why my program consumes so much memory that do not meet expectation?

2
  • Can you show the exact line of code that reads the data from file? It is hard to answer if we have to guess. Commented Oct 26, 2014 at 6:04
  • @BasSwinckels thank you, i use np.loadtxt() to read data. Saullo Castro has pointed the problem and explained this problem roughly. Commented Oct 26, 2014 at 9:43

2 Answers 2

2

I'm using Numpy 1.9.0 and the memory inneficiency of np.loadtxt() and np.genfromtxt() seems to be directly related to the fact they are based on temporary lists to store the data:

  • see here for np.loadtxt()
  • and here for np.genfromtxt()

By knowing beforehand the shape of your array you can think of a file reader that will consume an amount of memory very close to the theoretical amount of memory (3.2 GB for this case), by storing the data using the corresponding dtype:

def read_large_txt(path, delimiter=None, dtype=None):
    with open(path) as f:
        nrows = sum(1 for line in f)
        f.seek(0)
        ncols = len(f.next().split(delimiter))
        out = np.empty((nrows, ncols), dtype=dtype)
        f.seek(0)
        for i, line in enumerate(f):
            out[i] = line.split(delimiter)
    return out
Sign up to request clarification or add additional context in comments.

9 Comments

having seen the sample row, there may be a vast memory saving once sparse-matrix would rather be used, doesn't it?
@user3666197 surely yes, but that would require a more complex reader function....
sure, the OP issue seems to be the memory-bound, so this was a direction to tradeoff potentially blocking memory-bound issue for CPU-bound efforts, that make both the input per-se & the further processing feasible on even larger dataSETs ( my gut sense tells the OP is not seeking a one-liner or a few SLOC-s, but a feasible approach to input & process similar batches of data with numpy comfort, so will pay the cost of a bit smarter input-pre-processor )
@user3666197 I've tested here and the problem with np.loadtxt() and also np.genfromtxt() is not knowing the shape, forced to use temporary lists and list.append() (see here and here)
That was out of question, Saullo, as addressed in your Answer, the input-processor related issue. Excuse my remark, it just touched the proper ( a more efficient ) matrix-representation for the dataSET.
|
0

I think you should try pandas to handle big data ( text files). pandas is like a excel in python. And it internally use numpy to represent the data.

HDF5 files also an another method to save big data into hdf5 binary file.

This question would give some idea about how to handle big files - "Large data" work flows using pandas

1 Comment

i have not used pandas, thank you for your advice, i will learn it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.