2

I have a 3 columns file of about 28Gb. I would like to read it with python and put its content in a list of 3D tuples. Here's the code I'm using :

f = open(filename)
col1 = [float(l.split()[0]) for l in f]
f.seek(0)
col2 = [float(l.split()[1]) for l in f]
f.seek(0)
col3 = [float(l.split()[2]) for l in f]
f.close()
rowFormat = [col1,col2,col3]
tupleFormat = zip(*rowFormat)
for ele in tupleFormat: 
        ### do something with ele

There's no 'break' command in the for loop, meaning that I actually read the whole content of the file. When the script is being run, I notice from the 'htop' command that it takes 156G of virtual memory (VIRT column) and almost the same amount for the resident memory (RES column). Why is my script using 156G whereas the file size is only 28G ?

9
  • 1
    Even an int is an object with header and takes up more space than you might expect. Maybe you can use numpy.loadtxt()? Commented Apr 6, 2016 at 19:52
  • Why do you read the file three times? Commented Apr 6, 2016 at 19:57
  • Why do you need it all in memory at the same time? Commented Apr 6, 2016 at 19:58
  • Also, floats seems to take 24bytes of space per instance (check with sys.getsizeof(float(0)) Commented Apr 6, 2016 at 19:58
  • I did expect the process to use more memory than the actual size of my file, but I'm surprised that it uses memory of about 6 times the size of the file ! This is annoying because I don't think I have access to clusters with more than 200G of memory. Commented Apr 6, 2016 at 19:59

3 Answers 3

5

Python objects have a lot of overheard, e.g., reference count to the object and other stuff. That means that a Python float is more than 8 bytes. On my 32bit Python version, it is

>>> import sys
>>> print(sys.getsizeof(float(0))
16

A list has its own overhead and then requires 4 bytes per element to store a reference to that object. So 100 floats in a list actually take up a size of

>>> a = map(float, range(100))
>>> sys.getsizeof(a) + sys.getsizeof(a[0])*len(a)
2036

Now, a numpy array is different. It has a little bit of overhead, but the raw data under the hood are stored like in C.

>>> import numpy as np
>>> b = np.array(a)
>>> sys.getsizeof(b)
848
>>> b.itemsize    # number of bytes per element
8

So a Python float requires 20 bytes compared to 8 for numpy. And 64bit Python versions require even more.

So really, if you must load A LOT of data in memory, numpy is one way to go. Looking at the way you load the data, I assume it's in text format with 3 floats per row, split by an arbitrary number of spaces. In that case, you could simply will use numpy.genfromtxt()

data = np.genfromtxt(fname, autostrip=True)

You could also look for more options here, e.g., mmap, but I don't know much about it to say whether it'd be more appropriate for you.

Sign up to request clarification or add additional context in comments.

2 Comments

Well, using 'np.loadtxt' made my process use less memory, so I guess it's a solution. But I have a question : would it be faster to read the variable 'tupleFormat' from a 'pickle' representation in comparison with opening the file and reading it ? I notice that the pickle representation of 'tupleFormat' is at least twice the size of my original file, but I don't know maybe reading it will be faster than opening the file ?
@dada What do you want to do with that data? Reading/loading off the hard drive is a slow process. If you need to load/check a lot of values many times, you're better off loading them in memory once unless that is a physical restriction.
0

You need to read it line by line lazily using a generator. Try this:

col1 = []
col2 = []
col3 = []

rowFormat = [col1, col2, col3]

with open('test', 'r') as f:
    for line in f:
        parts = line.split()
        col1.append(float(parts[0]))
        col2.append(float(parts[1]))
        col3.append(float(parts[2]))
        # if possible do something here to start seeing results immediately

tupleFormat = zip(*rowFormat)
for ele in tupleFormat:
    ### do something with ele

You can add your logic in the for loop so you don't wait for the whole process to finish.

5 Comments

Wouldn't this still read it in one go instead of lazily? I mean to become a generator wouldn't he need to yield the current value for each line?
Why would this reduce the amount of memory used by the process running my script ?
Yes it will. 3rd option in the link above - "If the file is line-based, the file object is already a lazy generator of lines".
Sorry just saw you asked why, it will reduce the amount of memory because you will reading line by line in memory. This is different than what you did because the list comprehension you used [x for x in f] iterates the file and reads the whole column in each variable. And not only that you are also reading the file 3 times for that. What I'm suggesting is an improvement, although it seems using numpy - as Reti43 suggested - could be an even better solution.
0

Can you get by w/o storing every tuple? I.e. can "do something" happen as you read in the file? If so... try this:

#!/usr/bin/env python
import fileinput
for line in fileinput.FileInput('test.dat'):
    ele = tuple((float(x) for x in line.strip().split()))
    # Replace 'print' with your "do something".
    # Note that ele is now a generator, not a tuple.  Wrap it in
    # ele = tuple(ele) to get a tuple instead if you need it.
    print ele

If not, maybe you can save some memory by choosing either the column format or the list of tuples format, but not BOTH, for example....

#!/usr/bin/env python
import fileinput
elements = []
for line in fileinput.FileInput('test.dat'):
    elements.append(tuple((float(x) for x in line.strip().split())))

for ele in elements:
   # do something

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.