0

I'm having a strange issue in which a strange memory leak happens while reading from file in Python2.

I tried to eliminate leaking code for several hours, I failed. Then I tried to isolate leaking code and wrote a minimal program which reproduces the same leak. Here it is:

import sys
import re

@profile
def leak(filename):
    residual_line = []  
    with open(filename) as f:
        for line in f:
            splitted_line = re.split(r' |\n', line)
            del line
            filtered_line = filter(lambda x: x != '', splitted_line)
            del splitted_line
            filtered_line = residual_line + filtered_line
            for x in range(0,len(filtered_line)):
                a=5
            residual_line = filtered_line
            del filtered_line
    del residual_line

@profile
def main():
    filename = sys.argv[1]
    leak(filename)
    sys.exit(0)

main()

I'm profiling it with memory_profiler module, and here's the profiling output:

enter image description here

As you can see, a memory alloc happens in line 8 but never releases. Leak is 31 KiB, the file I'm trying to read is 3.4kB. If I double the filesize, leak becomes 70KiB, double again 160KiB, so leak probably depends on the file.

I hope someone can find the leak, thanks in advance.

6
  • I don't know whether it's related to your problem, but it looks like you have mixed tabs and spaces. That's a frequent source of bugs. Commented Dec 20, 2015 at 23:45
  • Yeah, I know, but it's not related. Commented Dec 20, 2015 at 23:47
  • You should fix the indentation and retest it anyway. A minimal example shouldn't have unrelated problems like bad indentation. Also, does the leak go away if you remove the loop over range(0, len(filtered_line)), or if you use residual_line += filtered_line instead of the weird and inefficient way you're currently adding to residual_line? Commented Dec 20, 2015 at 23:52
  • The file iterator reads in chunks so its not surprising that memory goes up. But it should be cleaned up when the with clause exits between lines 17 and 18. Maybe the profiler isn't so good a figuring that out. Commented Dec 20, 2015 at 23:56
  • I checked with psutils also, result is the same. Commented Dec 21, 2015 at 0:02

1 Answer 1

2

I doubt there is any memory leak, I just feel your approach comes from a lack of understanding how Python memory allocation works.

Introduction to Memory Allocation

Memory allocation works at several levels in Python. There’s the system’s own allocator, which is what shows up when you check the memory use using the Windows Task Manager or ps. Then there’s the C runtime’s memory allocator (malloc), which gets memory from the system allocator, and hands it out in smaller chunks to the application. Finally, there’s Python’s own object allocator, which is used for objects up to 256 bytes. This allocator grabs large chunks of memory from the C allocator, and chops them up in smaller pieces using an algorithm carefully tuned for Python.

Here's an example which demonstrates this: I first create an object, delete it, and then collect the leftover variables:

import gc

@profile
def create_list():
    lst = list(range(1000000))
    del lst
    gc.collect()


if __name__ == '__main__':
    create_list()

The results are as follows:

└──> python -m memory_profiler test.py
Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     3   20.414 MiB    0.000 MiB   @profile
     4                             def create_list():
     5   51.461 MiB   31.047 MiB       lst = list(range(1000000))
     6   43.828 MiB   -7.633 MiB       del lst
     7   26.793 MiB  -17.035 MiB       gc.collect()

If you notice, most of the memory is never returned to the system, and is instead kept within either the Python allocator or the C allocator.

If I do the same with a file, we get the same results:

import gc

@profile
def iter_file():
    with open('/path/to/somefile') as f:
        for line in f:
            del line
            gc.collect()


if __name__ == '__main__':
    iter_file()

And here are the results:

└──> python -m memory_profiler test.py
Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     3   20.496 MiB    0.000 MiB   @profile
     4                             def iter_file():
     5   20.496 MiB    0.000 MiB       with open('/path/to/somefile') as f:
     6   20.801 MiB    0.305 MiB           for line in f:
     7   20.613 MiB   -0.188 MiB               del line
     8   20.613 MiB    0.000 MiB               gc.collect()

In short, don't worry about memory leaks: just make sure variables go out of scope, and Python will handle it for you.

This also means you do not have to delete every newly created variable: as the variable is redefined, goes out of scope, it will automatically be garbage collected and the memory returned to the allocator, and potentially to the system.

Sign up to request clarification or add additional context in comments.

2 Comments

It seems you are right, can't do anything about it.
I know it seems horrifying from a C/C++ perspective, but it's actually part of the beauty of Python. As everything goes out of scope (from leaving a function, to deleting a reference), the memory is automatically cleaned up and the objects removed. If you need immediate cleanup, from a particularly large object, you can use gc.collect(), as shown above, albeit at a performance cost. It means that rather than cleaning after every struct on the heap, char array, etc., everything is cleaned up for you, so you can just write code.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.