1

I have a reasonably large file (~4gb on disk) that I want to access with Python's mmap module to gain some familiarity with memory maps. I have a 64 bit system, and am running something similar to the example below. When I run that, I notice that this process's memory consumption continually increases. I've profiled it with pympler and nothing stands out. Can someone point me to some resources that might describe what's going on under the hood and how to correct this (so I can scan through the file without this "memory leak" consuming all my memory)? Thanks!

import mmap                                                                                                                                                                                                                                  

with open("/path/to/large.file", "r") as j:
    mm = mmap.mmap(j.fileno(), 0, access=mmap.ACCESS_READ)

pos = 0
for i in range(mm.size()):
    new_pos = mm.find(b"10", pos)
    print(new_pos)
    pos = new_pos + 1

EDIT The file looks something like this:

0000001, data
0000002, more data
...
...

And with this number of sequential values in the first position there will be a lot of hits for find(b"10")

3
  • How much is the memory consumption increasing by? Python memory management is extremely difficult to optimise, the CPython interpreter almost never releases any memory that it allocates Commented Jul 19, 2020 at 23:41
  • @IainShelvington I let it get to ~50% of my available memory before killing it (having started at ~0% a few seconds earlier) Commented Jul 19, 2020 at 23:52
  • I'm debugging a similar problem, but my usage pattern is even simpler. I'm writing and reading fixed-length binary structs to disk using mmap. I'm not using find(), I instead use a separate catalog of offsets to look up and retrieve data. As I write new entries, the memory usage increases without bound. I'm going to do an experiment where I substitute binary file objects and simple write()+seek()+read() operations. It will likely kill the performance of my library, but I'll be able to ascertain whether or not the memory leak is caused by mmap. I'll report back here later. Commented Jul 23, 2020 at 13:36

1 Answer 1

1

Gather a live core of your process and use chap (open source software available at https://github.com/vmware/chap) to analyze that core.

Here are some commands that are relevant here for this use case:

describe used

This will describe all the used allocations (either allocated by python or by native code), but won't tell you directly about any regions that have been mmapped.

describe free

This will show allocations that have been freed but for which the associated space has not been given back to the operating system.

describe writable
describe readonly

These will tell you about larger regions, ones that are writable or readonly, respectively. In your case, where you specified ACCESS.READ for the mmapped allocation, that allocation, if still present would be seen as an unknown region in part of the output of "describe readonly" or part of such a region.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for the suggestion! I'll take a look asap, seems like a really good way to profile this.
You are welcome. If you have questions about the output, feel free to post them here.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.