14

This question comes from my curiosity about how Kubernetes handles resource requests and limits, especially memory constraints defined for pods.

I understand that Kubernetes uses cgroups under the hood to enforce these limits, and I'm trying to dig deeper into what actually happens at the kernel level when a process inside a cgroup exceeds its memory limit (memory.max, in cgroups v2).

Specifically, I'm wondering:

  • What criteria does the Linux kernel use to decide whether to deny a memory allocation versus triggering the OOM killer?
  • Is there a threshold or condition that determines which action is taken?

I'm exploring this in the context of containerized workloads (e.g., Kubernetes pods), but I'm interested in the general kernel behavior regardless of orchestration.

Thanks in advance!

1 Answer 1

15

With the default memory overcommit policy (vm.overcommit_memory=0), the kernel doesn't deny memory allocation unless a process tries to allocate more memory at once than the amount of physical memory plus swap present in the system.

When a process tries to use the allocated memory, the kernel maps a free physical memory page into the process's virtual memory address space. If such mapping results in exceeding the memory.max limit of the cgroup, then the kernel goes into direct reclaim and tries to free the least recently used clean pages from the file cache or to swap out (if the swap is enabled) the least recently used anonymous pages from this cgroup. If it is unable to reclaim enough pages, then the OOM-killer is invoked as the last resort to free pages.

I omit memory.high, memory.low, memory.min for simplicity. They are not used in Kubernetes unless Quality-of-Service for Memory Resources is enabled.

18
  • 1
    Yes, it is correct. With one important note - by default, it is the largest process that gets OOM-killed, not the allocating one. Commented Aug 29 at 11:10
  • 5
    When OOM is triggered in a cgroup, it is the largest process in the cgroup that gets killed, not the largest in the system. Killing the last process that pagefaulted is equivalent to killing a random process and doesn't solve the problem of low memory. Commented Aug 29 at 13:27
  • 1
    @SottoVoce If the cgroup has a hard memory limit, then yes, hitting that limit can trigger the OOM killer for just that cgroup. By default it will kill the biggest memory user in the cgroup, though it’s also possible to configure it to kill the entire cgroup instead (essentially equivalent to the vm.panic_on_oom sysctl option for the kernel, but for just that cgroup instead of the whole system). Commented Aug 29 at 16:30
  • 2
    @NicolaSergio The out of memory situation only happens when actually trying to use the pages, not when allocating them (unless you have the system set to never overcommit memory). And it’s not only possible, but also fairly common, that large percentages of allocated memory never actually get used, usually because the program is doing it’s own memory management internally. IOW, having 1 GiB of virtual memory allocated does not automatically mean you’re actually using 1 GiB of memory, and the cgroup limits put a limit on what you use, not what you ask for. Commented Aug 29 at 16:34
  • 1
    @NicolaSergio kubectl top uses the Kubernetes Metrics API, which in turn uses cAdvisor, which gets information from cgroup/memory.* files. The memory that they report, they call a "working set", which is supposed to be non-reclaimable. They calculate it as memory.current - inactive_file from memory.stat. They don't subtract active_file even though this memory can be reclaimed. memory.current is the memory that is actually mapped to physical memory pages and has been used at least once. Commented Aug 30 at 6:54

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.