NUMA in a hurry
Ready to give LWN a try?The kernel's behavior on non-uniform memory access (NUMA) systems is, by most accounts, suboptimal; processes tend to get separated from their memory, leading to lots of cross-node traffic and poor performance. Until now, the work to improve this situation has been a story of two competing patch sets; it recently appeared that one of them may be set to be merged as the result of decisions made outside of the community's view. But nothing in memory management is ever simple, so it should be unsurprising that the NUMA scheduling discussion has become more complicated.With a subscription to LWN, you can stay current with what is happening in the Linux and free-software community and take advantage of subscriber-only site features. We are pleased to offer you a free trial subscription, no credit card required, so that you can see for yourself. Please, join us!
On November 6, memory management hacker Mel Gorman, who had not contributed code of his own toward NUMA scheduling so far, posted a new patch series called "Foundation for automatic NUMA balancing," or "balancenuma" for short. He pointed out that there were objections to both of the existing approaches to NUMA scheduling and that it was proving hard to merge the best from each. So his objective was to add enough infrastructure to the memory management subsystem to make it easy to experiment with different NUMA placement policies. He also implemented a placeholder policy of his own:
In short, the MORON policy works by instantly migrating pages whenever a cross-node reference is detected using the NUMA hinting mechanism. Mel's second version, posted one week later, fixes a number of problems, adds the "home node" concept (that tries to keep processes and their memory on a single "home" NUMA node), and adds some statistics gathering to implement a "CPU follows memory" policy that can move a process to a new home node if it appears that better memory locality would result.
Andrea Arcangeli, author of the AutoNUMA approach, said that balancenuma "looks OK" and that
AutoNUMA could be built on top of it. Ingo Molnar, instead, was less
accepting, saying "I've picked up a
number of cleanups from your series and propagated them into tip:numa/core
tree.
" He later added a request
that Mel rebase his work on top of the numa/core tree. He clearly did not
see the patch set as a "foundation" on which to build. A new numa/core
patch set was posted on November 13.
Peter Zijlstra, meanwhile, has posted an "enhanced NUMA scheduling with adaptive affinity" patch set. This one does away with the "home node" concept altogether; instead, it looks at memory access patterns to determine where a process's memory lives and who that memory might be shared with. Based on that information, the CPU affinity mechanism is used to move processes to the appropriate nodes. Peter says:
This patch set has not gotten a lot of review comments, and it does not appear to have been folded into the numa/core series as of this writing.
What will happen in 3.8?
The numa/core approach remains in linux-next, which is intended
to be the final stage for code that is intended to be merged. And, indeed,
Ingo has reiterated that he plans to merge
this code for the 3.8 cycle, saying "numa/core sums up the consensus
so far
". The use of that language might rightly raise some
eyebrows; when there are between two and four competing patch sets
(depending on how one counts) aimed at the same
problem, the term "consensus" does not usually come to mind. And, indeed,
it seems that this consensus does not yet exist.
Andrew Morton has been overtly grumpy; the existence of numa/core in linux-next has made the management of his tree (which is based on linux-next) difficult — his tree needs to be ready for the 3.8 merge window where, he thinks, numa/core should not be under consideration:
Hugh Dickins, a developer who is not normally associated with this sort of discussion, chimed in as well:
If we had wanted to push in a good solution a little prematurely, we would surely have chosen Andrea's AutoNUMA months ago, despite efforts to block it; and maybe we shall still want to go that way.
Please, forget about v3.8, cut this branch out of linux-next, and seek consensus around getting it right for v3.9.
Rik van Riel agreed, saying "Having
unreviewed (some of it NAKed) code sitting in tip.git and you trying to
force it upstream is not the right way to go.
" He also suggested
that, if anything should be considered for merging in 3.8, it would be
Mel's foundation patches.
And that is where the discussion stands as of this writing. There is a lot of uncertainty about what might happen with NUMA scheduling in 3.8, meaning that, most likely, nothing will happen at all. It is highly unlikely that Linus would merge the numa/core set in the face of the above complaints; he would be far more likely to sit back and tell the developers involved to work out something they can all agree with. So this is a discussion that might go on for a while yet.
Making changes to the memory management subsystem is a famously hard thing to do, especially when the changes are as large as those being considered here. But there is another factor that is complicating this particular situation. As the term "NUMA scheduling" suggests, this is not just a memory management problem. The path to improved NUMA performance will require coordinated changes to — and greater integration between — the memory management subsystem and the CPU scheduler. It's telling that the developers on one side of this divide are primarily associated with scheduler development, while those on the other side are mostly memory management folks. Each camp is, in a sense, invading the other's turf in an attempt to create a comprehensive solution to the problem; it is not surprising that some disagreements have emerged.
Also implicit in this situation is that Linus is unlikely to attempt to
resolve the disagreement by decree. There are too many developers and too
many interrelated core subsystems involved. So some sort of rough
consensus will have to be found. Your editor's explicitly unreliable
prediction is that little NUMA-related work will be merged in the 3.8
development cycle. Under pressure from several directions, the developers
involved will figure out how to resolve their biggest differences in the
next few months. The resulting code will likely be at least partially
merged for 3.9 — later than many would wish, but the end result is likely
to be better than would be seen with a patch set rushed into 3.8.
Index entries for this article | |
---|---|
Kernel | Memory management/NUMA systems |
Kernel | NUMA |
Kernel | Scheduler/NUMA |
Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds