Simple But Effective Techniques for NUMA Memory Management

Bolosky, Fitzgerald, Scott

How do we manage memory in a NUMA, where each processor has a fast, unshared, local memory, and there is a slow, shared, global memory?
- hardware, OS, libraries or compiler, explicit application control
- hardware is too expensive; application control unnecessarily burdens programmers
- this paper puts management in the OS

used Mach OS: memory management is divided into machine-independent and machine-dependent parts, separated by a well-defined pmap interface
used IBM ACE multiprocessor: up to 8 processors or 128 MB global memory (but not at the same time); global memory about twice as slow as local memory (8 MB per processor)
took existing pmap (machine-dependent) layer and divided it into two modules: pmap manager (which exports the pmap interface to the machine-independent part of Mach), and mmu interface
wrote two other modules: NUMA manager (which maintains consistency of pages cached in local memories), and NUMA policy (which decides whether a page should be placed in local or global memory)

local memories are used as a cache for global memory
tell Mach that available memory is as big as the global memory
each local page is in one of these states:
- read-only: may appear in 0 or more local memories, must have read-only MMU protection in all
- local-writable: in exactly 1 local memory, may be writable
- global-writable: in global memory only, may be writable
NUMA manager calls NUMA policy module with a single function cache_policy(logical_page,protection), which returns LOCAL or GLOBAL. The actions then taken by the manager are summarized by a small FSM given in Tables 1 and 2 in the paper.

initially place all pages in local memory of whatever processor used them first
read-only pages are replicated in many local memories
privately writable pages are moved to the processor that writes them
shared writable pages (at least 1 writer, at least 1 other reader or writer) are moved between local caches as the manager keeps the cahes consistent
count these moves between caches for each page, and place the page in global memory (i.e., remove it from all local memory caches) when a threshold (global constant, default 4) is passed, where it remains until it is freed

The pmap interface had to be slightly extended to handle NUMA architecture:
- addition of new calls pmap_free_page, which is called when a physical page frame is freed, and starts a lazy cleanup of the frame, and pmap_free_page_sync, which is called when a new frame is allocated, which waits for the cleanup to finish; these calls are necessary to reset the cache state
- added a parameter for minimum allowed permissions to pmap_enter, so that, for example, a shared-writable page could be mapped read-only in order to get a write fault
- added a target processor argument to pmap_enter, so the NUMA manager can know which processor should get the page

the new pmap level itself must remain pinned in (global) memory
"false sharing", where unshared objects used by different processors happen to be located on the same memory page, can cause quite a bit of a performance hit; compiler support may be helpful here
how do you handle transient behaviour? maybe allow apps to hint at run time that a certain object will or won't be shared soon
how to deal with process migration? currently, pages just end up in global memory (ick)
they implemented a single policy, and found that it performed "well", but they didn't compare it to any other schemes, either hardware or software