What to expect with the 2.6 VM

Mel Gorman (mel@csn.ul.ie)
Tue, 1 Jul 2003 02:39:47 +0100 (IST)


Hi,

I'm writing a small paper on the 2.6 VM for a conference. It is based on
2.5.73 and is basically an introduction to some of the new features that are
up and coming. As it's only an introduction, it's a little light on heavy
detail. Below is the parts of the paper which I thought would be of general
interest (I left out stuff like the abstract, introduction, conclusion and
bibliography). Any comments, feedback or pointings out of big features I've
missed are appreciated.

Linux Virtual Memory Management
Some Things to Expect In 2.6

Describing Memory
=================

The 2.6 VM has put a lot of effort into being scalable to enterprise level
architectures while trying not to compromise on performance on low to mid
level computers. Consequently, a number of the changes are aimed at
reducing spin lock contention and keeping operations local to the CPU
where possible.

Physical nodes for NUMA architectures are described in a very similar
manner to 2.4 which just two minor differences. The first difference is
that the starting location of the node is now identified by a Page Frame
Number (PFN) instead of as a physical address which is an offset of pages
within the virtual mem_map array. This solved the problem of where
architecture used addressing extensions such as Intel's Physical Address
Extensions (PAE) which cannot be addressed by a 32 number. The second is
that each node now has its own wait queue for the page replacement daemon
kswapd which is discussed further in a later section.

The principal changes made to how zones are represented is related to the
page replacement policy. The LRU lists for pages and the associated locks
are now per-zone where they were global with 2.4. As LRU lists could grow
quite large, especially if IO intensive applications were running, the old
LRU list lock could heavily contended. This resulted in the LRU list being
locked while kswapd performed a linear search for pages in the zone of
interest. As the lists are now per-zone, and an instance of kswapd runs
for each node, the contention is greatly reduced.

The second noticeable change is aimed at improving CPU cache usage for
which Linux now uses a few simple tricks. The first utilises a padding in
the structures to ensure that zone>lock and
zone>lru_lock use separate lines [Sea] in the CPU cache. To
further alleviate lock contention, a list of pages, called a pageset, is
maintained for each CPU in the system and for each zone with
per_cpu_pageset array which is discussed later with Section 7.

Pages are represented in almost the same fashion as 2.4 except for one
innocent looking difference. In 2.4, process pages tables were traversed
to locate the struct page that represented the page mapped by the process.
However, there was no mechanism that would give a list of Page Table
Entries (PTE) that had mapped a particular page. This is now addressed by
what is referred to as Reverse Mapping1. PTEs which map a page can now be
found by traversing a chain of PTEs which is headed by a list stored in
the struct page. This is discussed later in Section 3.

Reverse Page Table Mapping
==========================

One of the most important introductions in the 2.6 kernel is Reverse
Mapping commonly referred to as rmap. In 2.4, there is a one way mapping
from a PTE to a struct page which is sufficient for process addressing.
However, this presents a problem for the page replacement algorithm when
dealing with shared pages are used as there is no way in 2.4 to acquire a
list of PTEs which map a particular struct page without traversing the
page tables for every running process in the system.

In 2.6, a new field is introduced to the struct page called union pte.
When a page is shared between two or more processes, a PTE chain is
created and linked to this field. PTE chain elements are managed by the
slab allocator and each node is able to locate up to NRPTE number of PTEs
where NRPTE is related to the L1 cache size of the target architecture.
Once NRPTE PTEs have been mapped, a new node is added to the PTE chain.

Using these chains, it is possible for all PTEs mapping a struct page to
be located and swapped out to backing storage. This means that it is now
unnecessary for whole processes to be swapped out when memory is tight and
most pages in the LRU lists are shared as was the case with 2.4. The 2.6
VM is able to make much better page replacement decisions as the LRU list
ordering is obeyed.

PTEs in High Memory
===================

In 2.4, Page Table Entries (PTEs) must be allocated from ZONE_ NORMAL as
the kernel needs to address them directly for page table traversal. In a
system with many tasks or with large mapped memory regions, this can place
significant pressure on ZONE_ NORMAL so 2.6 has the option of allocating
PTEs from high memory.

Allocating from high memory is a compile time option as there is a
significant penalty associated with high memory PTEs. As the PTEs can no
longer be addressed directly by the kernel, kmap() must be used to map the
high memory page into lower memory. It must be later unmapped with
kunmap() placing a limit on the number of PTEs that can be addressed by a
CPU at a time.

To reflect the move of PTEs to high memory, the API related to PTEs has
changed. The allocation function pte_alloc() has been replaced by two
functions. pte_alloc_kernel() is for kernel PTEs and are always allocated
from ZONE_ NORMAL. pte_alloc_map() for userspace allocations.

As the PTEs need to be mapped before addressing, page table walking has
changed slightly. In 2.4, the PTE could be referenced directly but in 2.6,
pte_offset_map() must be called before it may be used and pte_unmap()
called as quickly as possible to unmap it again.

The overhead of mapping the PTEs from high memory should not be ignored.
Generally, only one PTE may be mapped at a time for a CPU with the rare
exception when pte_offset_map_nested() is used. This can be a heavy
penalty when all PTEs need to be examined such as when a memory region is
being unmapped. It has been proposed to have a Userspace Kernel Virtual
Area (UKVA) which is a region in the kernel address space private to a
process. This UKVA could be used to map high memory PTEs into low memory
but it is currently a proposal and unlikely to be merged by 2.6.

At time of writing, a patch has been submitted which implemented PMDs in
high memory. The implementation is essentially the same as PTEs in high
memory.

Huge TLB Filesystem
===================

Most modern architectures support more than one page size. For example,
the IA-32 architecture supports 4KiB pages or 4MiB pages but Linux only
used large pages for mapping the actual kernel image. As TLB slots are a
scarce resource, it is desirable to be able to take advantages of the
large pages especially on machines with large amounts of physical memory.

In 2.6, Linux allows processes to use large pages, referred to as huge
pages. The number of available huge pages is configured by the system
administrator via the /proc/sys/vm/nr_hugepages proc interface. As the
success of the allocation depends on the availability of physically
contiguous memory, the allocation should be made during system startup.

The root of the implementation is a Huge TLB Filesystem (hugetlbfs) which
is a pseudo-filesystem implemented in fs/hugetlbfs/inode.c and based on
ramfs. The basic idea is that any file that exists in the filesystem is
backed by huge pages. This filesystem is initialised and registered as an
internal filesystem at system start-up.

There is two ways that huge pages may be accessed by a process. The first
is by using shmget() to setup a shared region backed by huge pages and the
second is the call mmap() on a file opened in the huge page filesystem.

When a shared memory region should be backed by huge pages, the process
should call shmget() and pass SHM_HUGETLB as one of the flags. This
results in a file being created in the root of the internal filesystem.
The name of the file is determined by an atomic counter called
hugetlbfs_counter which is incremented every time a shared region is
setup.

To create a file backed by huge pages, a filesystem of type hugetlbfs must
first be mounted by the system administrator. Once the filesystem is
mounted, files can be created as normal with the system call open(). When
mmap() is called on the open file, the hugetlbfs registered mmap()
function creates the appropriate VMA for the process.

Huge TLB pages have their own function for the management of page tables,
address space operations and filesystem operations. The names of the
functions for page table management can all be seen in $<$linux/hugetlb.h
$>$ and they are named very similar to their ``normal'' page equivalents.

Non-Linear Populating of Virtual Areas
======================================

In 2.4, a VMA backed by a file would be populated in a linear fashion.
This can be optionally changed in 2.6 with the introduction of the
MAP_POPULATE flag to mmap() and the new system call remap_file_pages().
This system call allows arbitrary pages in an existing VMA to be remapped
to an arbitrary location on the backing file. This is mainly of interest
to database servers which previously simulated this behavior by the
creation of many VMAs.

On page-out, the non-linear address for the file is encoded within the PTE
so that it can be installed again correctly on page fault. How it is
encoded is architecture specific so two macros are defined called
pgoff_to_pte() and pte_to_pgoff() for the task.

Physical Page Allocation
========================

Physical page allocation is still based on the buddy algorithm but it has
been extended to resemble a lazy buddy algorithm which delays the
splitting and coalescing of buddies until it is necessary [BL89]. To
implement this, kernel 2.6 has per-cpu page lists of 0 order pages, the
most common type of allocation and free. These pagesets contain two lists
for hot and cold pages where hot pages have been recently used and can
still be expected to be present in the CPU cache. For an allocation, the
pageset for the running CPU will be first checked and if pages are
available, they will be allocated. To determine when the pageset should be
emptied or filled, two watermarks are in place. When the low watermark is
reached, a batch of pages will be allocated and placed on the list. When
the high watermark is reached, a batch of pages will be freed at the same
time. Higher order allocations are treated essentially the same as they
were in 2.4.

The implication of this change is straight forward, the number of times
the spinlock protecting the buddy lists must be acquired is drastically
reduced. Higher order allocations are relatively rare in Linux so the
optimisation is for the common case. This change will be noticeable on
large number of CPU machines but will make little difference to single
CPUs.

The second major change is largely cosmetic with function declarations,
flags and defines bring moved to new header files. One important part of
the shuffling is that the distinction between NUMA and UMA architectures
no longer exists for page allocations. In 2.4, there was an explicit
function which implemented a node-local allocation policy. In 2.6,
architectures are expected to provide a CPU to NUMA node mapping. They
then export a macro called NODE_DATA() which takes the NID, returned by
numa_node_id() as a parameter. With UMA architectures, this will always
return the static node descriptor contig_page_data, but with NUMA
architectures, the node ``closest'' to the running CPU will be used. This
is still a node-local allocation policy but without the two separate
allocation schemes. This change is largely a ``cleaniness'' issue reducing
the amount of special case code that deals with UMA and NUMA architectures
separately.

The last major change is the introduction of three new GFP flags called
__GFP_NOFAIL, __GFP_REPEAT and __GFP_NORETRY. The three flags determine
how hard the VM works to satisfy a given allocation and were introduced as
the flags are implemented in many different parts of the kernel. The
NOFAIL flag requires teh VM to constantly retry an allocation until it
succeeds. The REPEAT flag is intended to retry a number of times before
failing, although currently it is implemented like NOFAIL. The last,
NORETRY says that an allocation will fail immediately and return.

Delayed Coalescing
==================

2.6 extends the buddy algorithm to resemble a lazy buddy algorithm [BL89]
which delays the splitting and coalescing of buddies until it is
necessary. The delay is implemented only for 0 order allocations with the
per-cpu pagesets. Higher order allocations, which are rare anyway, will
require the interrupt safe spinlock to be held and there will be no delay
in the splits or coalescing. With 0 order allocations, splits will be
delayed until the low watermark is reached in the per-cpu set and
coalescing will be delayed until the high watermark is reached.

It is interesting to note that there is a possibility that high order
allocations will fail if many of the pagesets are just below the high
watermark. There is also the problem that buddies could exist in different
pagesets leading to fragmentation problems.

Emergency Memory Pools
======================

In 2.4, the high memory manager was the only subsystem that maintained
emergency pools of pages. In 2.6, memory pools are implemented as a
generic concept when a minimum amount of ``stuff'' needs to be reserved
for when memory is tight. ``Stuff'' in this case can be any type of object
such as pages in the case of the high memory manager or, more frequently,
some object managed by the slab allocator.

Pools are created with mempool_create() and a number of parameters are
provided. They are the minimum number of objects that should be reserved
(min_nr), an allocator function for the object type (alloc_fn()), a free
function (free_fn()) and optional private data that is passed to the
allocate and free functions.

In most cases, the allocate and free functions used are
mempool_alloc_slab() and mempool_free_slab() for reserving objects managed
by the slab allocator. In this case, the private data based is a pointer
to the slab cache descriptor.

In the case of the high memory manager, two pools of pages are created. On
page pool is for normal use and the second page pool is for use with ISA
devices that must allocate from ZONE_ DMA. The memory pools replace the
emergency pool code that exists in 2.4.

Aging Slab Allocator Objects
============================

Many of the changes in the slab allocator for 2.6 are either cosmetic or
are related to the reduction of lock contention. However, one significant
new feature is the introduction of slab shrinker callback.

In 2.4, the function kmem_cache_reap() was called in low memory situations
which selected a cache to delete all free slabs from and free objects in
the per-cpu object cache. In 2.6, caches can register a shrinker callback
with set_shrinker() and kmem_cache_reap() has been dispensed with.

set_shrinker() populates a struct with a pointer to the callback and a
``seeks' weight which indicates how hard it is to recreate the object.
During page reclaim, each shrinker is called twice. The first call passes
0 as a parameter which indicates that the callback should return how many
pages it expects it could free if it was called properly. A basic
heuristic is applied to determine if it is worth the cost of using the
callback. If it is, it is called a second time with a parameter indicating
how many objects to free.

Page Replacement
================

In the 2.4 kernel, a global daemon called kswapd acted as the page
replacement daemon. It was responsible for keeping a reserve of memory
free and only when memory was tight would processes have to free pages on
their own behalf. In 2.6, a kswapd daemon exists for each node in the
system. Each daemon is called kswapdN where N is the node ID. The
presumption is that there will be a one-to-few mapping between CPUs and
nodes and Linux wishes to avoid a penalty of a kswapd daemon trying to
free pages on a remote node.

Manipulating LRU Lists
======================

In 2.4, a spinlock is acquired when removing pages from the LRU list. This
made the lock very heavily contended so in 2.6, operations involving the
LRU lists take place via struct pagevec structures. This allows pages to
be added or removed from the LRU lists in batches of up to PAGEVEC_SIZE
numbers of pages.

When removing pages, the zone>lru_lock lock is acquired and
the pages placed on a temporary list. Once the list of pages to remove is
assembled, shrink_list() is called to perform the actual freeing of pages
which can now perform most of it's task without needing the
zone>lru_lock spinlock.

When adding the pages back, a new page vector struct is initialised with
pagevec_init(). Pages are added to the vector with pagevec_add() and then
committed to being placed on the LRU list in bulk with pagevec_release().

Footnotes

... Mapping1
Reverse mapping is commonly referred to as rmap although the
phrase rmap is sometimes a little overloaded.

-- 
Mel Gorman
MSc Student, University of Limerick
http://www.csn.ul.ie/~mel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/