9

I'm using this simple code and observing monotonically increasing memory usage. I'm using this little module to dump stuff to disk. I observed it happens with unicode strings and not with integers, is there something I'm doing wrong?

When I do:

>>> from utils.diskfifo import DiskFifo
>>> df=DiskFifo()
>>> for i in xrange(1000000000):
...     df.append(i)

Memory consumption is stable

but when I do:

>>> while True:
...     a={'key': u'value', 'key2': u'value2'}
...     df.append(a)

It goes to the roof. Any hints? below the module...


import tempfile
import cPickle

class DiskFifo:
    def __init__(self):
        self.fd = tempfile.TemporaryFile()
        self.wpos = 0
        self.rpos = 0
        self.pickler = cPickle.Pickler(self.fd)
        self.unpickler = cPickle.Unpickler(self.fd)
        self.size = 0

    def __len__(self):
        return self.size

    def extend(self, sequence):
        map(self.append, sequence)

    def append(self, x):
        self.fd.seek(self.wpos)
        self.pickler.dump(x)
        self.wpos = self.fd.tell()
        self.size = self.size + 1

    def next(self):
        try:
            self.fd.seek(self.rpos)
            x = self.unpickler.load()
            self.rpos = self.fd.tell()
            return x

        except EOFError:
            raise StopIteration

    def __iter__(self):
        self.rpos = 0
        return self
10
  • Why not use shelve? docs.python.org/library/shelve.html Commented Jul 28, 2011 at 9:55
  • How are you measuring memory consumption? Are you aware that Python rarely (almost never) returns memory to the OS? Commented Jul 28, 2011 at 10:02
  • @S.Lott sort of, but then it should stabilize at some point right? one thing is not returning and the other is leaking... Commented Jul 28, 2011 at 10:05
  • 3
    @piotr: a 'leak' is when the memory is still claimed but is inaccessible to the application. If python can still use the memory but hasn't decided to free it, say it's lying stale in a cache somewhere, then it isn't a leak. Commented Jul 28, 2011 at 10:08
  • 1
    When you do for i in xrange('1000000000') you'll get a TypeError. Commented Jul 28, 2011 at 10:26

2 Answers 2

15

The pickler module is storing all objects it has seen in its memo, so it doesn't have to pickle the same thing twice. You want to skip this (so references to your objects aren't stored in your pickler object) and clear the memo before dumping:

def append(self, x):
    self.fd.seek(self.wpos)
    self.pickler.clear_memo()
    self.pickler.dump(x)
    self.wpos = self.fd.tell()
    self.size = self.size + 1

Source: http://docs.python.org/library/pickle.html#pickle.Pickler.clear_memo

Edit: You can actually watch the size of the memo go up as you pickle your objects by using the following append function:

def append(self, x):
    self.fd.seek(self.wpos)
    print len(self.pickler.memo)
    self.pickler.dump(x)
    self.wpos = self.fd.tell()
    self.size = self.size + 1
Sign up to request clarification or add additional context in comments.

5 Comments

That doesn't explain the increase in memory, since the object being pickled again is the same.
Yes, it does. When you call self.pickler.dump(x), the pickler object does something like self.memo.append(x). As you go through your while True: loop in your example code, you are creating thousands of objects which your pickler object is keeping references to, meaning they are kept in memory and not gotten rid of by the GC. Calling self.pickler.clear_memo() essentially causes the pickler to do self.memo = [], getting rid of any references to the objects and allowing the GC to get rid of them.
@poitr - I've edited my answer with some code which will allow you to watch the size of the memo increase as you pickle things.
This made a huge difference in memory consumption, from 1.5G to 11M
Just bear in mind that this might cause your pickle to be larger - nothing will be picked by reference.
0

To add to the answer by combatdave@:

I just bypassed the terrible memo caching in pickle since clearing the memo on the reader side seems impossible and was an apparently unavoidable memory leak. Pickle streaming seem to be designed for reading and writing moderately sized files, not for reading and writing unbounded streams of data.

Instead I just used the following simple utility functions:

def framed_pickle_write(obj, stream):
    serial_obj = pickle.dumps(obj)
    length = struct.pack('>I', len(serial_obj))
    stream.write(length)
    stream.write(serial_obj)


def framed_pickle_read(stream):
    data = stream.read(4)
    length, = struct.unpack('>I', data)
    serial_obj = stream.read(length)
    return pickle.loads(serial_obj)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.