Kernel development
Brief items
Kernel release status
The current development kernel is 4.0-rc3, released on March 8. "Back on track with a Sunday afternoon release schedule, since there was nothing particularly odd going on this week, and no last-minute bugs that I knew of and wanted to get fixed holding things up."
Stable updates: 3.19.1, 3.18.9, 3.14.35, and 3.10.71 were all released on March 7.
Quotes of the week
The kernel's code of conflict
A brief "code of conflict" was merged into the kernel's documentation directory for the 4.0-rc3 release. The idea is to describe the parameters for acceptable discourse without laying down a lot of rules; it also names the Linux Foundation's technical advisory board as a body to turn to in case of unacceptable behavior. This document has been explicitly acknowledged by a large number of prominent kernel developers.Sasha Levin picks up 3.18 maintenance
By the normal schedule, the 3.18 stable update series is due to come to an end about now. In this case, though, Sasha Levin has decided to pick up the maintenance for this kernel, so updates will continue coming through roughly the end of 2016.
Kernel development news
Progress on persistent memory
It has been "the year of persistent memory" for several years now, Matthew Wilcox said with a chuckle to open his plenary session at the 2015 Storage, Filesystem, and Memory Management summit in Boston on March 9. Persistent memory refers to devices that can be accessed like RAM, but will permanently store any data written to them. The good news is that there are some battery-backed DIMMs already available, but those have a fairly small capacity at this point (8GB, for example). There are much larger devices coming, 400GB was mentioned, but it is not known when they will be shipping. From Wilcox's talk, it is clear that the two different classes of devices will have different use cases, so they may be handled differently by the kernel.
It is good news that there are "exciting new memory products" in development, he said, but it may still be some time before we see them on the market. He is not sure that we will see them this year, for example. It turns out that development delays sometimes happen when companies are dealing with "new kinds of physics".
![Matthew Wilcox [Matthew Wilcox]](https://static.lwn.net/images/2015/lsf-wilcox-sm.jpg)
Christoph Hellwig jumped in early on in the talk to ask if Wilcox's employer, Intel, would be releasing its driver for persistent memory devices anytime soon. Wilcox was obviously unhappy with the situation, but said that the driver could not be released until the ACPI specification for how the device describes itself to the system is released. That is part of the ACPI 6 process, which will be released "when ACPI gets around to it". As soon as that happens, Intel will release its driver.
James Bottomley noted that there is a process within UEFI (which oversees ACPI) to release portions of specifications if there is general agreement by the participants to do so. He encouraged Intel to take advantage of that process.
Another attendee asked whether it was possible to write a driver today that would work with all of the prototype devices tested but wouldn't corrupt any other of the other prototypes that had not been tested. Wilcox said no; at this point that isn't the case. "It is frustrating", he said.
Persistent memory and struct page
He then moved on to a topic he thought would be of interest to the memory-management folks in attendance. With a 4KB page size, and a struct page for each page, the 400GB device he mentioned would require 6GB just to track those pages in the kernel. That is probably too much space to "waste" for those devices. But if the kernel tracks the memory with page structures, it can be treated as normal memory. Otherwise, some layer, like a block device API, will be needed to access the device.
Wilcox has been operating under the assumption that those kinds of devices won't use struct page. On the other hand, Boaz Harrosh (who was not present at the summit) has been pushing patches for other, smaller devices, and those patches do use struct page. That makes sense for that use case, Wilcox said, but it is not the kind of device he has been targeting.
Those larger devices have wear characteristics that are akin to those of NAND flash, but it isn't "5000 cycles and the bit is dead". The devices have wear lifetimes of 107 or 108 cycles. In terms of access times, some are even faster than DRAM, he said.
Ted Ts'o suggested that the different capacity devices might need to be treated differently. Dave Chinner agreed, saying that the battery-backed devices are effectively RAM, while the larger devices are storage, which could be handled as block devices.
Wilcox said he has some preliminary patches to replace calls to get_user_pages() for these devices with a new call, get_user_sg(), which gets a scatter/gather list, rather than pages. That way, there is no need to have all those page structures to handle these kinds of devices. Users can treat the device as a block device. They can put a filesystem on it and use mmap() for data access.
That led to a discussion about what to do to handle a truncate() on a file that has been mapped with mmap(). Wilcox thinks that Unix, thus Linux, has the wrong behavior in that scenario. If a program accesses memory that is no longer part of the mapped file due to the truncation, it gets a SIGSEGV. Instead, he thinks that the truncate() call should be made to wait until the memory is unmapped.
Making truncate() wait is trivial to implement, Peter Zijlstra said, but it certainly changes the current behavior. He suggested adding a flag to mmap() to request this mode of operation. That should reduce the surprise factor as it makes the behavior dependent on what is being mapped. Ts'o said that he didn't think the kernel could unconditionally block truncate operations for hours or days without breaking some applications.
Getting back to the question of the drivers, Ts'o asked what decisions needed to be made and by when. The battery-backed devices are out there now, so patches to support them should go in soon, one attendee said. Hellwig said that it would make sense to have Harrosh's driver and the Intel driver in the kernel. People could then choose the one that made sense for their device. In general, that was agreeable, but the driver for the battery-backed devices still needs some work before it will be ready to merge. Bottomley noted that means that the group has decided to have two drivers, "one that needs cleaning up and one we haven't seen".
New instructions
Wilcox turned to three new instructions that Intel has announced for its upcoming processors that can be used to better support persistent memory and other devices. The first is clflushopt, which adds guarantees to the cache-line flush (clflush) instruction. The main benefit is that it is faster than clflush. Cache-line writeback (clwb) is another, which writes the cache line back to memory, but still leaves it in the cache. The third is pcommit, which acts as a sort of barrier to ensure that any prior cache flushes or writebacks actually get to memory.
The effect of pcommit is global for all cores. The idea is to do all of the flushes, then pcommit; when it is done, all that data will have been written. On current processors, there is no way to be sure that everything has been stored. He said that pcommit support still needs to be added to DAX, the direct access block layer for persistent memory devices that he developed.
Ts'o asked about other processors that don't have support for those kinds of instructions, but Wilcox didn't have much of an answer for that. He works for Intel, so other vendors will have to come up with their own solutions there.
There was also a question about adding a per-CPU commit, which Wilcox said was under internal discussion. But Bottomley thought that if there were more complicated options, that could just lead to more problems. Rik van Riel noted that the scheduler could move the process to a new CPU halfway through a transaction anyway, so the target CPU wouldn't necessarily be clear. In answer to another question, Wilcox assured everyone that the flush operations would not be slower than existing solutions for SATA, SAS, and others.
Error handling
His final topic was error handling. There is no status register that gives error indications when you access a persistent memory device, since it is treated like memory. An error causes a machine check, which typically results in a reboot. But if the problem persists, it could just result in another reboot when the device is accessed again, which will not work all that well.
To combat this, there will be a log of errors for the device that can be consulted at startup. It will record the block device address where problems occur and filesystems will need to be able to map that back to a file and offset, which is "not the usual direction for a filesystem". Chinner spoke up to say that XFS would have this feature "soon". Ts'o seemed to indicate ext4 would also be able to do it.
But "crashing is not a great error discovery technique", Ric Wheeler said; it is "moderately bad" for enterprise users to have to reboot their systems that way. But handling the problems when an mmap() is done for that bad region in the device is not easy either. Several suggestions were made (a signal from the mmap() call or when the page table entry is created, for example), but any of them mean that user space needs to be able to handle the errors.
In addition, Chris Mason said that users are going to expect to be able to mmap() a large file that has one bad page and still access all of the other pages from the file. That may not be reasonable, but is what they will expect. At that point, the discussion ran out of time without reaching any real conclusion on error handling.
[I would like to thank the Linux Foundation for travel support to Boston for the summit.]
Allowing small allocations to fail
As Michal Hocko noted at the beginning of his session at the 2015 Linux Storage, Filesystem, and Memory Management Summit, the news that the memory-management code will normally retry small allocations indefinitely rather than returning a failure status came as a surprise to many developers. Even so, this behavior is far from new; it was first added to the kernel in 2001. At that time, only order-0 (single-page) allocations were treated that way, but, as the years went by, that limit was raised repeatedly; in current kernels, anything that is order-3 (eight pages) or less will not normally be allowed to fail. The code to support this mode of operation has become more complex over time as well.Relatively late in the game, the __GFP_NOFAIL flag was added to specifically annotate the places in the kernel where failure-proof allocations are needed, but the "too small to fail" behavior has never been removed from other allocation operations. After 14 years, Michal said, there will certainly be many places in the code that depend on these semantics. That is unfortunate, since the failure-proof mode is error-prone and unable to deal with real-world situations like infinite retry loops outside of the allocator, locking conflicts, and out-of-memory (OOM) situations. The result is occasional lockups as described in this article.
There have been various attempts to get around the problem, such as adding
timeouts to the OOM killer (see this
article), but Michal thinks such approaches are "not nice." The proper
way to handle that kind of out-of-memory problem is to simply fail
allocation requests when the necessary resources are not available. Most
of the kernel already has code to check for and deal with such situations;
beyond that, the memory-management code should not be attempting to dictate
the failure strategy to the rest of the kernel.
Changing the allocator's behavior is relatively easy; the harder question is how to make such a change without introducing all kinds of hard-to-debug problems. The current code has worked for 14 years, so there will be many paths in the kernel that rely on it. Changing its behavior will certainly expose bugs.
Michal posted a patch just before the summit demonstrating the approach to the problem that he is proposing. That patch adds a new sysctl knob that controls how many times the allocator should retry a failed attempt before returning a failure status; setting it to zero disables retries entirely, while a setting of -1 retains the current behavior. There is a three-stage plan for the use of this knob. In the first stage, the default setting would be for indefinite retries, leaving the kernel's behavior unchanged. Developers and other brave people, though, would be encouraged to set the value lower. The hope is to find and fix the worst of the resulting bugs in this stage.
In the second stage, an attempt would be made to get distributors to change the default value. In the third and final stage, the default would be changed in the upstream kernel itself. Even in this stage, where, in theory, the bugs have been found, the knob would remain in place so that especially conservative users could keep the old behavior.
Michal opened up the discussion by asking if the assembled developers thought this was the right approach. Rik van Riel said that most kernel code can handle allocation failure just fine, but a lot of those allocations happen in system calls. In such cases, the failures will be passed back to user space; that is likely to break applications that have never seen certain system calls fail in this way before.
Ted Ts'o added that the kernel would mostly likely be stuck in the first stage for a very long time. As soon as distributions start changing the allocator's behavior, their phones will start ringing off the hook. In the ext4 filesystem, he has always been nervous about passing out-of-memory failures back to user space because of the potential for application problems. If the system call interface does that instead it won't be his fault, he said, but things will still break.
Peter Zijlstra observed that ENOMEM is a valid return from a system call. Ted agreed, but said that, after all these years, applications will break anyway, and then users will be knocking at his door. He went on to say that in large data-center settings (Google, for example) where the same people control both kernel and user space it should be possible to find and fix the resulting bugs. But just fixing the bugs in open-source programs is going to be a long process. In the end, he said, such a change is going to have to provide a noticeable benefit to users — a much more robust kernel, say — or we will be torturing them for no reason.
Andrew Morton protested that the code we have now seems to work almost all of the time. Given that the reported issues are quite rare, he asked, what problem are we actually trying to solve? Andrea Arcangeli noted that he'd observed lockups and that the OOM killer's relatively unpredictable behavior does not help. He tried turning off the looping in the memory allocator and got errors out of the ext4 filesystem instead. It was a generally unpleasant situation.
Andrew suggested that making the OOM killer work better might be a better place to focus energy, but Dave Chinner disagreed, saying that it was an attempt to solve the wrong problem. Rather than fix the OOM killer, it would be better to not use it at all. We should, he said, take a step back and ask how we got into the OOM situation in the first place. The problem is that the system has been overcommitted. Michal said that overcommitting of resources was just the reality of modern systems, but Dave insisted that we need to look more closely at how we manage our resources.
Andrew returned to the question of improving the OOM killer. Perhaps, he said, it could be made to understand lock dependencies and avoid potential deadlock situations. Rik suggested that was easier said than done, though; for example, an OOM-killed process may need to acquire new locks in order to exit. There will be no way for the OOM killer to know what those locks might be prior to choosing a victim to kill. Andrew acknowledged the difficulties but insisted that not enough time has gone into making the OOM killer work better. Ted said that OOM killer improvements were needed regardless of any other changes; since the allocator's default behavior cannot be changed for years, we will be stuck with the OOM killer for some time.
Michal was nervous about the prospect of messing with the OOM killer. We don't, he said, want to go back to the bad old days when its behavior was far more random than it is now. Dave said, though, that it is not possible to have a truly deterministic OOM killer if the allocation layers above it are not deterministic. It will behave differently every time it is tested. Until things are solidified in the allocator, the OOM killer is, he said, not the place to put effort.
The session wound down with Michal saying that starting to test kernels that fail small allocations will be helpful even if the distributors do not change the default for a long time. Dave said that he would turn off looping in the xfstests suite by default. There was some talk about the best values to use, but it seems it matters little as long as the indefinite looping is turned off. Expect to see a number of interesting bugs once this testing begins.
[Your editor would like to thank LWN subscribers for funding his travel to LSFMM 2015.]
Improving huge page handling
The "huge page" feature found in most contemporary processors enables access to memory with less stress on the translation lookaside buffer (TLB) and, thus, better performance. Linux has supported the use of huge pages for some years through both the hugetlbfs and transparent huge pages features, but, as was seen in the two sessions held during the memory-management track at LSFMM 2015, there is still considerable room for improvement in how this support is implemented.
Kirill Shutemov started off by describing his proposed changes to how
reference counting for transparent huge pages is handled. This patch set
was described in detail in this article
last November and has not changed significantly since. The key part of the
patch is that it allows a huge page to be simultaneously mapped in the PMD
(huge page) and PTE (regular page) modes. It is, as he acknowledged, a
large patch set, and there are still some bugs, so it is not entirely
surprising that this work has not been merged yet.
One remaining question has to do with partial unmapping of huge pages. When a process unmaps a portion of a huge page, the expected behavior is to split that page up and return the individual pages corresponding to the freed region back to the system. It is also possible, though, to split up the mapping while maintaining the underlying memory as a huge page. That keeps the huge page together and allows it to be quickly remapped if the process decides to do so. But that also means that no memory will actually be freed, so it is necessary to add the huge page to a special list where it can be truly split up should the system experience memory pressure.
Deferred splitting also helps the system to avoid another problem: currently there is a lot of useless splitting of huge pages when a process exits. There was some talk of trying to change munmap() behavior at exit time, but it is not as easy as it seems, especially since the exiting process may not hold the only reference to any given huge page.
Hugh Dickins, the co-leader of the session, pointed out that there is one
complication with
regard to
Kirill's patch set: he is not the only one who is working with simultaneous
PMD and PTE mappings of huge pages. Hugh recently posted a patch set of his own adding transparent huge page
support to the tmpfs filesystem. This work contains a number of the
elements needed for full support
for huge pages in the page cache (which is also an eventual goal of
Kirill's patches). But Hugh's approach is rather different, leading to
some concern in the user community; in the end, only one of these patch
sets is likely to be merged.
Hugh's first goal is to provide a more flexible alternative for users of the hugetlbfs filesystem. But his patches diverge from the current transparent huge page implementation (and Kirill's patches) in a significant way: they completely avoid the use of "compound pages," the mechanism used to bind individual pages into a huge page. Compound pages, he said, were a mistake to use with transparent huge pages; they are too inflexible for that use case. Peter Zijlstra suggested that, if this is really the case, Hugh should look at moving transparent huge pages away from compound pages; Hugh expressed interest but noted that available time was in short supply.
Andrea Arcangeli (the original author of the transparent huge pages feature) asked Hugh to explain the problems with compound pages. Hugh responded that the management of page flags is getting increasingly complicated when huge pages are mapped in the PTE mode. So he decided to do everything in tmpfs with ordinary 4KB pages. Kirill noted that this approach makes tmpfs more complex, but Hugh thought that was an appropriate place for the complexity to be.
When it comes to bringing huge page support to the page cache, though, it's not clear where the complexity should be. Hugh dryly noted that filesystem developers already have enough trouble with the memory-management subsystem without having to deal with more complex interfaces for huge page support. He was seemingly under the impression that there is not a lot of demand for this support from the filesystem side. Btrfs developer Chris Mason said, though, that he would love to find ways to reduce overhead on huge-memory systems, and that huge pages would help. Matthew Wilcox added that there are users even asking for filesystem support with extra-huge (1GB) pages.
Rik van Riel jumped in to ask if there were any specific questions that needed to be answered in this session. Hugh returned to the question of whether filesystems need huge page support and, if so, what form it should take, but not much discussion of that point ensued. There was some talk of Hugh's tmpfs work; he noted that one of the hardest parts was support for the mlock() system call. There is a lot of tricky locking involved; he was proud to have gotten it working.
In a brief return to huge page support in the page cache, it was noted that Kirill's reference-counting work can simplify things considerably; Andrea said it was attractive in many ways.
There was some talk of what to do when an application calls madvise() on a portion of a huge page with the MADV_DONTNEED command. It would be nice to recover the memory, but that involves an expensive split of the page. Failure to do so can create problems; they have been noted in particular with the jemalloc implementation of malloc(). See this page for a description of these issues.
Even if a page is split when madvise(MADV_DONTNEED) is called on a portion of it, there is a concern that the kernel might come around and "collapse" it back into a huge page. But Andrea said this should not be a problem; the kernel will only collapse memory into huge pages if the memory around those pages is in use. But, in any case, he said, user space should be taught to use 2MB pages whenever possible. Trying to optimize for 4KB pages on current systems is just not worth it and can, as in the jemalloc case, create problems of its own.
The developers closed out this session by agreeing to look more closely at both approaches. There is a lot of support for the principles behind Kirill's work. Hugh complained that he hasn't gotten any feedback on his patch set yet. While the patches are under review, Kirill will look into extending his work to the tmpfs filesystem, while Hugh will push toward support for anonymous transparent huge pages.
Compaction
The topic of huge pages returned on the second day, however, when Vlastimil Babka ran a session focused primarily on the costs of compaction. The memory compaction code moves pages around to create large, physically contiguous regions of free memory. These regions can be used to support large allocations in general, but they are especially useful for the creation of huge pages.
The problem comes in when a process incurs a page fault, and the kernel
attempts to resolve it by allocating a huge page. That task can involve
running compaction which, since it takes a while, can create significant
latencies for the faulting process. The cost can, in fact, outweigh the
performance benefits of using huge pages in the first place. There are
ways of mitigating this cost, but, Vlastimil wondered, might it be better
to avoid allocating huge pages in response to faults in the first place?
After all, it is not really known whether the process needs the entire huge page
or not; it's possible that much of that memory might be wasted. It seems
that this happens, once again, with the jemalloc library.
Since it is not possible to predict the benefit of supplying huge pages at fault time, Vlastimil said, it might be better to do a lot less of that. Instead, transparent huge pages should mostly be created in the khugepaged daemon, which can look at memory utilization and collapse pages in the background. Doing so requires redesigning khugepaged, which was mainly meant to be a last resort filling in huge pages when other methods fail. It scans slowly, and can't really tell if a process will benefit from huge pages; in particular, it does not know if the process will spend a lot of time running. It could be that the process mostly lurks waiting for outside events, or it may be about to exit.
His approach is to improve khugepaged by moving the scanning work that looks for huge page opportunities into process context. At certain times, such as on return from a system call, each process would scan a bit of its memory and, perhaps, collapse some pages into huge pages. It would tune itself automatically based partially on success rate, but also simply based on the fact that a process that runs more often will do more scanning. Since there is no daemon involved, there are no extra wakeups; if a system is wholly idle, there will be no page scanning done.
Andrea protested, though, that collapsing pages in khugepaged is far more expensive than allocating huge pages at fault time. To collapse a page, the kernel must migrate (copy) all of the individual small pages over to the new huge page that will contain them; that takes a while. If the huge page is allocated at page fault time, this work is not needed; the entire huge page can be faulted in at once. There might be a place for process-context scanning to create huge pages before they are needed, but it would be better, he said, to avoid collapsing pages whenever possible.
Vlastimil suggested allocating huge pages at fault time but only mapping the specific 4KB page that faulted; the kernel could then observe utilization and collapse the page in-place if warranted. But Andrea said that would needlessly deprive processes of the performance benefits that come from the use of huge pages. If we're going to support this feature in the kernel, we should use it fully.
Andi Kleen said that running memory compaction in process context is a bad idea; it takes away opportunities for parallelism. Compaction scanning should be done in a daemon process so that it can run on a separate core; to do otherwise would be to create excessive latency for the affected processes. Andrea, too, said that serializing scanning with execution was the wrong approach; he suggested putting that work into a workqueue instead. But Mel Gorman said he would rather see the work done in process context so that it can be tied to the process's activity.
At about this point the conversation wound down without having come to any firm conclusions. In the end, this is the sort of issue that is resolved over time with working code.
User-space page fault handling
Andrea Arcangeli's userfaultfd() patch set has been in development for a couple of years now; it has the look of one of those large memory-management changes that takes forever to find its way into the mainline. The good news in this case was announced at the beginning of this session in the memory-management track of the 2015 Linux Storage, Filesystem, and Memory Management Summit: there is now the beginning of an agreement with Linus that the patches are in reasonable shape. So we may see this code merged relatively soon.
The userfaultfd() patch set, in short, allows for the handling of
page faults in user space. This seemingly crazy feature was originally
designed for the migration of virtual machines running under KVM. The
running guest can move to a new host while leaving its memory behind,
speeding the migration. When that guest starts faulting in the missing
pages, the user-space mechanism can pull them across the net and store them
in the guest's address space. The result is quick migration without the
need to put any sort of page-migration protocol into the kernel.
Andrea was asked whether the kernel, rather than implementing the file-descriptor-based notification mechanism, could just use SIGBUS signals to indicate an access to a missing page. That will not work in this case, though. It would require massively increasing the number of virtual memory areas (VMAs) maintained in the kernel for the process, could cause system calls to fail, and doesn't handle the case of in-kernel page faults resulting from get_user_pages() calls. What's really needed is for a page fault to simply block the faulting process while a separate user-space process (the "monitor") is notified to deal with the issue.
Pavel Emelyanov stood up to talk about his use case for this feature, which is the live migration of containers using the checkpoint-restore in user space (CRIU) mechanism. While the KVM-based use case involves having the monitor running as a separate thread in the same process, the CRIU case requires that the monitor be running in a different process entirely. This can be managed by sending the file descriptor obtained from userfaultfd() over a socket to the monitor process.
There are, Pavel said, a few issues that come up when userfaultfd() is used in this mode. The user-space fault handling doesn't follow a fork() (it remains attached to the parent process only), so faults in the child process will just be resolved with zero-filled pages. If the target process moves a VMA in its virtual address space with mremap(), the monitor will see the new virtual addresses and be confused by them. And, after a fork, existing memory goes into the copy-on-write mode, making it impossible to populate pages in both processes. The conversation did not really get into possible solutions for these problems, though.
Andrea talked a bit about the userfaultfd() API, which has evolved in the past months. There is now a set of ioctl() calls for performing the requisite operations. The UFFDIO_REGISTER call is used to tell the kernel about a range of virtual addresses for which faults will be handled in user space. Currently the system only deals with page-not-present faults. There are plans, though, to deal with write-protect faults as well. That would enable the tracking of dirtied pages which, in turn, would allow live snapshotting of processes or the active migration of pages back to a "memory node" elsewhere on the network.
With regard to the potential live-snapshotting feature, most of the needed mechanism is already there. There is one little problem in that, should the target modify a page that is currently resident on the swap device, the resulting swap-in fault will make the page writable. So userfaultfd() will miss the write operation and the page will not be copied. Some changes to the swap code will be needed to add a write-protect bit to swap entries before this feature will work properly.
Earlier versions of the patch introduced a remap_anon_pages() system call that would be used to slot new pages into the target process's address space. In the current version, that operation has been turned into another ioctl() operation. Actually, there is more than one; there are now options to either copy a page into the target process or to remap the page directly. Zero-copy operation has a certain naive appeal, but it turns out that the associated translation lookaside buffer (TLB) flush is more expensive than simply copying the data. So the remap option is of limited use and unlikely to make it upstream.
Andrew Lutomirski worried that this feature was adding "weird semantics" to memory management. Might it be better, he said, to set up userfaultfd() as a sort of device that could then be mapped into memory with mmap()? That would isolate the special-case code and not change how "normal memory" behaves. The problem is that doing things this way would cause the affected memory range to lose access to many other useful memory-management features, including swapping, transparent huge pages, and more. It would, Pavel said, put "weird VMAs" into a process that really just "wants to live its own life" after migration.
As the discussion headed toward a close, Andrea suggested that userfaultfd() could perhaps be used to implement the long-requested "volatile ranges" feature. First, though, there is a need to finalize the API for this feature and get it merged; it is currently blocking the addition of the post-copy migration feature to KVM.
Fixing the contiguous memory allocator
Normally, kernel code goes far out of its way to avoid the need to allocate large, physically contiguous regions of memory, for a simple reason: the memory fragmentation that results as the system runs can make such regions hard to find. But some hardware requires these regions to operate properly; low-end camera devices are a common example. The kernel's contiguous memory allocator (CMA) exists to meet this need, but, as two sessions dedicated to CMA during the 2015 Linux Storage, Filesystem, and Memory Management Summit showed, there are a number of problems still to be worked out.CMA works by reserving a zone of memory for large allocations. But the device needing large buffers is probably not active at all times; keeping that memory idle when the device does not need it would be wasteful. So the memory-management code will allow other parts of the kernel to allocate memory from the CMA zone, but only if those allocations are marked as being movable. That allows the kernel to move things out of the way should the need for a large allocation arise.
Laura Abbott started off the session by noting that there are a number of
problems with CMA, relating to both the reliability of large allocations
and the performance of the system as a whole. There are a couple of
proposals out there to fix it — guaranteed
CMA by SeongJae Park and ZONE_CMA from
Joonsoo Kim — but no consensus on how to proceed. Joonsoo helped to lead
the session, as did Gioh Kim.
Peter Zijlstra asked for some details on what the specific problems are. A big one appears to be the presence of pinned pages in the CMA region. All it takes is one unmovable page to prevent the allocation of a large buffer, which is why pinned pages are not supposed to exist in the CMA area. It turns out that pages are sometimes allocated as movable, but then get pinned afterward. Many of these pins are relatively short-lived, but sometimes they can stay around for quite a while. Even relatively short-lived pins can be a problem, though; delaying the startup of a device like a camera can appear as an outright failure to the user.
One particular offender, according to Gioh, appears to be the ext4 filesystem which, among other things, is putting superblocks (which are pinned for as long as the associated filesystem is mounted) in movable memory. Other code is doing similar things, though. The solution in these cases is relatively straightforward: find the erroneous code and fix it. The complication here, according to Hugh Dickins, is that a filesystem may not know that a page will need to be pinned at the time it is allocated.
Mel Gorman suggested that, whenever a page changes state in a way that could
block a CMA allocation, it should be migrated immediately. Even something
as transient as pinning a dirty page for writeback could result in that
page being shifted out of the CMA area. It would be relatively simple to
put hooks into the memory-management code to do the necessary migrations.
The various implementations of get_user_pages() would be one
example; the page fault handler when a page is first dirtied would be
another. A warning could be added when get_page() is called to pin a
page in the CMA area to call attention to other problematic uses.
This approach, it was thought, could help to avoid the need for more
complex solutions within CMA itself.
Of course, that sort of change could lead to lots of warning noise for cases when pages are pinned for extremely short periods of time. Peter suggested adding a variant of get_page() to annotate those cases. Dave Hansen suggested, instead, that put_page() could be instrumented to look at how long the page was pinned and issue warnings for excessive cases.
The second class of problems has to do with insufficient utilization of the CMA area when the large buffers are not needed. Mel initially answered that CMA was simply meant to work that way and that it would not be possible to relax the constraints on the use of the CMA area without breaking it. It eventually became clear that the situation is a bit more subtle than that, but that had to wait until the second session on the following day.
It took a while to get to the heart of the problem on the second day, but Joonsoo finally described it as something like the following. The memory-management code tries to avoid allocations from the CMA area entirely whenever possible. As soon as the non-CMA part of memory starts to fill, though, it becomes necessary to allocate movable pages from the CMA area. But, at that point, memory looks tight, so kswapd starts running and reclaiming memory. The newly reclaimed memory, probably being outside of the CMA area, will be preferentially used for new allocations. The end result is that memory in the CMA area goes mostly unused, even when the system is under memory pressure.
Gioh talked about his use case, in which Linux is embedded in televisions.
There is a limited amount of memory in a TV; some of it must be reserved
for the processing of 3D or high-resolution streams. When that is not
being done, though, it is important to be able to utilize that memory for
other purposes. But the kernel is not making much use of that memory when
it is available; this is just the problem described by Joonsoo.
Joonsoo's solution involves adding a new zone (ZONE_CMA) to the memory-management subsystem. Moving the CMA area into a separate zone makes it relatively easy to adjust the policies for allocation from that area without, crucially, adding more hooks to the allocator's fast paths. But, as Mel said, there are disadvantages to this approach. Adding a zone will change how page aging is done, making it slower and more cache-intensive since there will be more lists to search. These costs will be paid only on systems where CMA is enabled so, he said, it is ultimately a CMA issue, but people should be aware that those costs will exist. That is the reason that a separate zone was not used for CMA from the beginning.
Dave suggested combining ZONE_CMA with ZONE_MOVABLE, which is also meant for allocations that can be relocated on demand. The problem there, according to Joonsoo, is that memory in ZONE_MOVABLE can be taken offline, while memory for CMA should not be unpluggable in that way. Putting CMA memory into its own zone also makes it easier to control allocation policies and to create statistics on the utilization of CMA memory.
The session ended with Mel noting that there did not appear to be any formal objections to the ZONE_CMA plan. But, he warned, the CMA developers, by going down that path, will be trading one set of problems for another. Since the tradeoff only affects CMA users, it will be up to them to decide whether it is worthwhile.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Janitorial
Memory management
Security-related
Page editor: Jonathan Corbet
Next page:
Distributions>>