A CLOCK-Pro page replacement implementation

[Posted August 16, 2005 by corbet]

Virtual memory management appears to be a perennially unsolved operating systems problem. Nobody has yet figured out how to perform page replacement in such a way as to ensure that the pages that will be needed in the future may be found in main memory. Crystal balls, it seems, remain fiendishly difficult to implement.

The reigning algorithm used in most systems is a variant of the least-recently-used (LRU) scheme. If a page has not been used in a long time, the reasoning goes, it probably will not be needed again in the near future; pages which have not been used for a while are thus candidates for eviction from main memory. In practice, tracking the usage of every page would impose an unacceptable amount of overhead, and is not done. Instead, the VM subsystem scans sequentially through the "active list" of pages in use, marking them as "inactive." Pages on the inactive list are candidates for eviction. Some of those pages will certainly be needed soon, however, with the result that they will be referenced before that eviction takes place. When this happens, the affected pages are put back on the active list at the "recently used" end. As long as pages stay in the inactive list for a reasonable time before eviction, this algorithm approximates a true LRU scheme.

This mechanism tends to fall apart with certain types of workloads, however. Actions like initializing a huge array, reading a large file (for streaming media playback, for example), starting OpenOffice, or walking through a large part of the filesystem can fill main memory with pages which are unlikely to be used again anytime soon - at the expense of the pages the system actually needs. Pages from files start in the inactive list and may, at least, be shoved back out relatively quickly, but anonymous memory pages go straight to the active list. Many Linux users are familiar with the occasional sluggish response which can come after the active list has been flushed in this way; with some workloads, this behavior can be a constant thing, and the system will consistently perform poorly.

Rik van Riel has recently posted a set of patches aimed at improving the performance of the VM subsystem under contemporary loads. The algorithm implemented is based on CLOCK-Pro, developed by Song Jiang, Feng Chen, and Xiaodong Zhang. CLOCK-Pro attempts to move beyond the LRU approach by tracking how often pages are accessed and tweaking the behavior of the VM code to match. At its core, CLOCK-Pro tries to ensure that pages in the inactive list are referenced less frequently than those on the active list. It thus differs from LRU schemes, which prioritize the most recently accessed pages even if those particular pages are almost never used by the application. Consider, as an example, the diagram to the right showing access patterns for two pages. At the time t1 marked by the red line, an LRU algorithm would prefer page 2 over page 1, even though the latter is more likely to be used again in the near future.

Implementing CLOCK-Pro requires that the kernel keep track of pages which have recently been evicted from main memory. To this end, Rik's patches create a new data structure which tries to perform this tracking without adding much extra overhead. There is a new kernel function:

    int do_remember_page(struct address_space *mapping, unsigned long index);

The VM code will, when moving a page out of main memory, first call remember_page() with the relevant information. This function implements a data structure which looks a little like the following:

When a page is to be remembered, a hash value is generated from the mapping and index parameters; this value will be used as an index into the nonres_table array. Each hash bucket contains a fixed number of entries for nonresident pages. do_remember_page() treats the hash bucket like a circular buffer; it will use the hand index to store a cookie representing the page (a separate hash, essentially) in the next available slot, possibly overwriting information which was there before. The size of the entire data structure is chosen so that it can remember approximately as many evicted pages as there are pages of real memory in the system. The cost of the structure is one 32-bit word for each remembered page.

At some point in the future, the kernel will find itself faulting a page into memory. It can then see if it has seen that page before with a call to:

    int recently_evicted(struct address_space *mapping, unsigned long index);

A non-negative return value indicates that the given page was found in the nonresident page cache, and had, indeed, been evicted not all that long ago. The return value is actually an estimate of the page's "distance" - a value which is taken by seeing how far the page's entry is from the current value of the hand index (in a circular buffer sense) and scaling it by the size of the array. In a rough sense, the distance is the number of pages which have been evicted since the page of interest was pushed out.

Whenever a page is faulted in, the kernel computes a distance for the oldest page in the active list; this distance is an estimate taken from how long ago the oldest page would have been scanned (at the current rate). This distance is compared to the distance of the newly-faulted page (which is scaled relative to the total number of recently evicted pages) to get a sense for whether this page (which had been evicted) has been accessed more frequently than the oldest in-memory page. If so, the kernel concludes that the wrong pages are in memory; in response, it will decrease the maximum desired size of the active list to make room for the more-frequently accessed pages which are languishing in secondary storage. The kernel will also, in this case, add the just-faulted page directly to the active list, on the theory that it will be useful for a while.

If, instead, pages being faulted in are more "distant" than in-core pages, the VM subsystem concludes that it is doing the right thing. In this situation, the size of the active list will be slowly increased (up to a maximum limit). More distant pages are faulted in to the inactive list, meaning that they are more likely to be evicted again in the near future.

Your editor applied the patch to a vanilla 2.6.12 kernel and ran some highly scientific tests: a highly parallel kernel make while simultaneously running a large "grep -r to read large amounts of file data into the page cache. The patched kernel adds a file (/proc/refaults) which summarizes the results from the nonresident page cache; after this experiment it looked like this:

     Refault distance          Hits
        0 -      4096           138
     4096 -      8192           108
     8192 -     12288            93
    12288 -     16384            88
    16384 -     20480            86
    20480 -     24576            84
    24576 -     28672            59
    28672 -     32768            48
    32768 -     36864            53
    36864 -     40960            46
    40960 -     45056            43
    45056 -     49152            46
    49152 -     53248            39
    53248 -     57344            39
    57344 -     61440            39
 New/Beyond     61440         11227

This histogram shows that the vast majority of pages brought into the system had never been seen before; they would be mainly the result of the large grep. A much smaller number of pages - a few hundred - had very small distances. If the patch is working right, those pages (being, one hopes, important things like the C compiler) would be fast-tracked into the active list while the large number of unknown pages would be hustled back out of main memory relatively quickly.

As it turns out, the patch doesn't work right quite yet. Much of the structure is in place, but the desired results are not yet being seen. These details will presumably be worked out before too long. Only at that point will it be possible to benchmark the new paging code and decide whether it truly performs better or not. One never knows ahead of time with virtual memory code; the proof, as they say, is in the paging.

[Thanks to Rik van Riel for his review of a previous draft of this article.]

Index entries for this article
Kernel	CLOCK-Pro
Kernel	Memory management/Page replacement algorithms

A CLOCK-Pro page replacement implementation

Posted Aug 18, 2005 7:42 UTC (Thu) by lacostej (guest, #2760) [Link] (1 responses)

Once again a great article. Thanks.

A CLOCK-Pro page replacement implementation

Posted Aug 25, 2005 11:58 UTC (Thu) by forthy (guest, #1525) [Link]

I've proposed something like this probably 5 years ago. Well, I've only
started a discussion, and didn't write code, and the discussion was turned
down, because the Linux kernel maintainer back then thought the VM was
"good enough" (I don't think so).

The main point in the scheme I proposed is a multi-level list of pages,
not just a two-level as it is now. Pages which have been accessed recently
elevate by one level, pages which didn't go down one level. That way,
frequently used pages end up on the top, while unused pages quickly turn
down to the bottom, making them subject for eviction. Page scanning should
depend on page age, i.e. recently allocated pages should be scanned more
often than old pages.

New pages, however, don't start at the top, but (maybe depending on how
they were allocated) somewhere in between.

My other pet peeve is the page size. My rule of thumb for random data
access is that access time and transfer time should be about the same.
I.e. 4k pages, which take 10ms to access and 0.1ms to transfer, are ways
too small. The swap in bandwidth is about 400k/s. You can write a small
memory hog which evicts most of the other programs in a second (writing
out sequential pages is easy), and then wait for a minute to have them
demand-page in again (no easy read-in of sequential pages).

At the time where the page size decission was made (~1985), the 4k size
did make sense, taking my rule of thumb into account (same 10ms access
time, but clearly less than 1MB/s transfer). Fortunately, it's only a few
years when the next larger native page size on x86 (2MB) will fit in well
with the rule of thumb, and then should be used.

A CLOCK-Pro page replacement implementation

Posted Aug 30, 2005 23:20 UTC (Tue) by nickb (guest, #32173) [Link] (1 responses)

It's good to see Rik back on the memory management task again. His work in the 2.3 kernels was certainly responsible for many of the stability issues of early 2.4 releases. Finally the bullet was bitten and his code removed.

I hope we don't see a repeat of this in the 2.6 kernels.

A CLOCK-Pro page replacement implementation

Posted Aug 31, 2005 19:03 UTC (Wed) by riel (subscriber, #3142) [Link]

Kernels get unstable when the maintainer only merges experimental code and drops bugfixes on the floor. Luckily these days the kernel has a decent process for making sure that doesn't happen, with patches filtering through Andrew Morton's -mm tree first, and Linus having tools to be able to merge the patches he needs to merge.

During the early 2.4 days, all of the bugfixes needed to get the VM stable were in the -ac kernels, but due to a lack of tools and process they never made it into Linus' tree...