Efficient Virtual Memory for Big Memory Servers

A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, ISCA 2013

Summary

Many “big-memory” server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for paged VM. TLB misses can account for up to 51% of execution time, while rich features of paged VM is not needed by most applications.

Proposal: Paged VM + Direct Segements, an alternative for huge pages. Mapping part of a process’s linear virtual address space with a direct segment, while page mapping the rest of the virtual address space.

direct-segment

Motivation

Many “big-memory” server workloads pay a high cost for paged VM: they suffer from high TLB misses while not requiring rich features of paged VM.

Trend

The amount of physical memory has gone from a few MBs to a few GBs and then a few TBs now.
The size of TLBs has remained fairly unchanged.
Many “big-memory” workloads exhibit low access locality.

Higher memory capacity + const TLB + low locality access pattern = more TLB misses.

Observation of Big-memory Workloads

For the majority of their address space, big-memory workloads do not require swapping, fragmentation mitigation, or fine-grained protection afforded by current virtual memory implementations. They allocate memory early and have stable memory usage.
Big-memory workloads pay a cost of paged VM: substantial performance lost to TLB misses.
Many big-memory workloads are long running, sized to match memory capacity, and have one (or a few) primary processses.

Solution: Paged VM + Direct Segements

Goal: enable fast and minimalist address translation through segmentation where possible, while defaulting to conventional page-based virtual memory where needed.

Proposal: direct-segment hardware that is used via a software primary region.

Hardware Support: Direct Segment

Idea: Translate a contiguous virtual address range directly onto a contiguous physical address range to eliminate TLB miss. Any virtual address outside the aforementioned virtual address range is mapped through conventional paging.

Implementation: Segmentation. BASE, LIMIT, OFFSET registers added per core. Direct segments are aligned to the base page size, so page offset bits are omitted from these registers. A given virtual address for a process is translated either through direct segment or through conventional page-based virtual memory but never both.

Software Support: Primary Region

OS provides a primary region abstraction to let applications specify which portion of their memory does not benefit from paging.
OS provisions physical memory for a primary region and maps all or part of the primary region through a direct segment by configuring the direct-segment registers.

Two approaches to manage physical memory:

Create contiguous physical memory dynamically through periodic memory compaction.
Use physical memory reservations and set aside memory immediately after system startup.

Why Not Huge Pages?

Large pages and their TLB support do not automatically scale to much larger memories. To support big-memory workloads, the size of large pages and/or size of TLB hierarchy must continue to scale as memory capacity increases.
Efficient TLB support for multiple page sizes is difficult. Because the indexing address bits for large pages are unknown until the translation completes, a split-TLB design is typically required where separate sub-TLBs are used for different page sizes. This design can suffer from performance unpredictability while using larger page sizes.
Large page sizes are often few and far apart.

Results from Experiments

For all workloads examined (graph500, memcached, MySQL, NPB:BT, NPB:CG, GUPS), the percentage of time spent on TLB miss handling was reduced to less than 0.5%.

Questions

For long-running workloads, can we gradually optimize the virtual-physical mapping by placing important things in the direct segment?
Can multiple processes use direct segment concurrently? If so, how to manage the direct segment?