Simple But Effective Techniques for NUMA Memory Management
Bolosky, Fitzgerald, Scott
Overview
- How do we manage memory in a NUMA, where each processor has a fast,
unshared, local memory, and there is a slow, shared, global memory?
- hardware, OS, libraries or compiler, explicit application control
- hardware is too expensive; application control unnecessarily burdens
programmers
- this paper puts management in the OS
Implementation
- used Mach OS: memory management is divided into machine-independent and
machine-dependent parts, separated by a well-defined pmap interface
- used IBM ACE multiprocessor: up to 8 processors or 128 MB global memory
(but not at the same time); global memory about twice as slow as local memory
(8 MB per processor)
- took existing pmap (machine-dependent) layer and divided it into
two modules: pmap manager (which exports the pmap interface to the
machine-independent part of Mach), and mmu interface
- wrote two other modules: NUMA manager (which maintains consistency of
pages cached in local memories), and NUMA policy (which decides whether a page
should be placed in local or global memory)
The NUMA Manager
- local memories are used as a cache for global memory
- tell Mach that available memory is as big as the global memory
- each local page is in one of these states:
- read-only: may appear in 0 or more local memories, must have read-only
MMU protection in all
- local-writable: in exactly 1 local memory, may be writable
- global-writable: in global memory only, may be writable
- NUMA manager calls NUMA policy module with a single function
cache_policy(logical_page,protection), which returns LOCAL or GLOBAL.
The actions then taken by the manager are summarized by a small FSM given in
Tables 1 and 2 in the paper.
NUMA Policy
- initially place all pages in local memory of whatever processor used them
first
- read-only pages are replicated in many local memories
- privately writable pages are moved to the processor that writes them
- shared writable pages (at least 1 writer, at least 1 other reader or
writer) are moved between local caches as the manager keeps the cahes
consistent
- count these moves between caches for each page, and place the page in
global memory (i.e., remove it from all local memory caches) when a threshold
(global constant, default 4) is passed, where it remains until it is freed
Changes to Machine-independent Part of Mach
- The pmap interface had to be slightly extended to handle NUMA
architecture:
- addition of new calls pmap_free_page, which is called when a
physical page frame is freed, and starts a lazy cleanup of the frame, and
pmap_free_page_sync, which is called when a new frame is allocated,
which waits for the cleanup to finish; these calls are necessary to reset
the cache state
- added a parameter for minimum allowed permissions to
pmap_enter, so that, for example, a shared-writable page could be
mapped read-only in order to get a write fault
- added a target processor argument to pmap_enter, so the NUMA
manager can know which processor should get the page
Other Notes
- the new pmap level itself must remain pinned in (global) memory
- "false sharing", where unshared objects used by different processors
happen to be located on the same memory page, can cause quite a bit of a
performance hit; compiler support may be helpful here
- how do you handle transient behaviour? maybe allow apps to hint at run
time that a certain object will or won't be shared soon
- how to deal with process migration? currently, pages just end up in global
memory (ick)
- they implemented a single policy, and found that it performed "well", but
they didn't compare it to any other schemes, either hardware or software