Xen project Mailing List

[Xen-changelog] [xen-unstable] docs: Add tmem documentation

From: Xen patchbot-unstable <patchbot-unstable@xxxxxxxxxxxxxxxxxxx>

Date: Thu, 22 Apr 2010 04:30:24 -0700

Delivery-date: Thu, 22 Apr 2010 04:31:09 -0700

List-id: BK change log <xen-changelog.lists.xensource.com>

# HG changeset patch # User Keir Fraser <keir.fraser@xxxxxxxxxx> # Date 1271835091 -3600 # Node ID c7d7797656dfe6c41ab59954a7bba1a2748c6836 # Parent b36467432effd0c7a94c2677df978b509c0497f7 docs: Add tmem documentation Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx> --- docs/misc/tmem-internals.html | 798 ++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 798 insertions(+) diff -r b36467432eff -r c7d7797656df docs/misc/tmem-internals.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/docs/misc/tmem-internals.html Wed Apr 21 08:31:31 2010 +0100 @@ -0,0 +1,798 @@ +<h1>Transcendent Memory Internals in Xen</h1> + +by Dan Magenheimer, Oracle Corp. + +Draft 0.1 -- Updated: 20100324 +<h2>Overview</h2> + +This document focuses on the internal implementation of +Transcendent Memory (tmem) on Xen. It assumes +that the reader has a basic knowledge of the terminology, objectives, and +functionality of tmem and also has access to the Xen source code. +It corresponds to the Xen 4.0 release, with +patch added to support page deduplication (V2). + +The primary responsibilities of the tmem implementation are to: +<ul> +<li>manage a potentially huge and extremely dynamic +number of memory pages from a potentially large number of clients (domains) +with low memory overhead and proper isolation +<li>provide quick and efficient access to these +pages with as much concurrency as possible +<li>enable efficient reclamation and eviction of pages (e.g. when +memory is fully utilized) +<li>optionally, increase page density through compression and/or +deduplication +<li>where necessary, properly assign and account for +memory belonging to guests to avoid malicious and/or accidental unfairness +and/or denial-of-service +<li>record utilization statistics and make them available to management tools +</ul> +<h2>Source Code Organization</h2> + + +The source code in Xen that provides the tmem functionality +is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h. +The files tmem.c and tmem.h are intended to +be implementation- (and hypervisor-) independent and the other two files +provide the Xen-specific code. This +division is intended to make it easier to port tmem functionality to other +hypervisors, though at this time porting to other hypervisors has not been +attempted. Together, these four files +total less than 4000 lines of C code. + +Even ignoring the implementation-specific functionality, the +implementation-independent part of tmem has several dependencies on +library functionality (Xen source filenames in parentheses): +<ul> +<li> +a good fast general-purpose dynamic memory +allocator with bounded response time and efficient use of memory for a very +large number of sub-page allocations. To +achieve this in Xen, the bad old memory allocator was replaced with a +slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by +Nitin Gupta for compcache. +<li> +good tree data structure libraries, specifically +red-black trees (rbtree.c) and radix trees (radix-tree.c). +Code for these was borrowed for Linux and adapted for tmem and Xen. +<li> +good locking and list code. Both of these existed in Xen and required +little or no change. +<li> +optionally, a good fast lossless compression +library. The Xen implementation added to +support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta. +</ul> + +More information about the specific functionality of these +libraries can easily be found through a search engine, via wikipedia, or in the +Xen or Linux source logs so we will not elaborate further here. + +<h2>Prefixes/Abbreviations/Glossary</h2> + + +The tmem code uses several prefixes and abbreviations. +Knowledge of these will improve code readability: +<ul> +<li> +tmh == +transcendent memory host. Functions or +data structures that are defined by the implementation-specific code, i.e. the +Xen host code +<li> +tmemc +== transcendent memory control. +Functions or data structures that provide management tool functionality, +rather than core tmem operations. +<li> +cli or +client == client. +The tmem generic term for a domain or a guest OS. +</ul> + +When used in prose, common tmem operations are indicated +with a different font, such as <big><kbd>put</kbd></big> +and <big><kbd>get</kbd></big>. + +<h2>Key Data Structures</h2> + + +To manage a huge number of pages, efficient data structures +must be carefully selected. + +Recall that a tmem-enabled guest OS may create one or more +pools with different attributes. It then +<kbd>put</kbd></big>s and <kbd>get</kbd></big>s +pages to/from this pool, identifying the page +with a handle that consists of a pool_id, an +object_id, and a page_id (sometimes +called an index). +This suggests a few obvious core data +structures: +<ul> +<li> +When a guest OS first calls tmem, a client_t is created to contain +and track all uses of tmem by that guest OS. Among +other things, a client_t keeps pointers +to a fixed number of pools (16 in the current Xen implementation). +<li> +When a guest OS requests a new pool, a pool_t is created. +Some pools are shared and are kept in a +sharelist (sharelist_t) which points +to all the clients that are sharing the pool. +Since an object_id is 64-bits, +a pool_t must be able to keep track +of a potentially very large number of objects. +To do so, it maintains a number of parallel trees (256 in the current +Xen implementation) and a hash algorithm is applied to the object_id +to select the correct tree. +Each tree element points to an object. +Because an object_id usually represents an inode +(a unique file number identifier), and inode numbers +are fairly random, though often "clumpy", a red-black tree +is used. +<li> +When a guest first +<kbd>put</kbd></big>s a page to a pool with an as-yet-unused object_id, an +obj_t is created. Since a page_id is usually an index into a file, +it is often a small number, but may sometimes be very large (up to +32-bits). A radix tree is a good data structure to contain items +with this kind of index distribution. +<li> +When a page is +<kbd>put</kbd></big>, a page descriptor, or pgp_t, is created, which +among other things will point to the storage location where the data is kept. +In the normal case the pointer is to a pfp_t, which is an +implementation-specific datatype representing a physical pageframe in memory +(which in Xen is a "struct page_info"). +When deduplication is enabled, it points to +yet another data structure, a pcd_t +(see below). When compression is enabled +(and deduplication is not), the pointer points directly to the compressed data. +For reasons we will see shortly, each pgp_t that represents +an ephemeral page (that is, a page placed +in an ephemeral pool) is also placed +into two doubly-linked linked lists, one containing all ephemeral pages +<kbd>put</kbd></big> by the same client and one +containing all ephemeral pages across all clients ("global"). +<li> +When deduplication is enabled, multiple pgp_t's may need to point to +the same data, so another data structure (and level of indirection) is used +called a page content descriptor, or pcd_t. +Multiple page descriptors (pgp_t's) may point to the same pcd_t. +The pcd_t, in turn, points to either a pfp_t +(if a full page of data), directly to a +location in memory (if the page has been compressed or trailing zeroes have +been eliminated), or even a NULL pointer (if the page contained all zeroes and +trailing zero elimination is enabled). +</ul> + +The most apparent usage of this multi-layer web of data structures +is "top-down" because, in normal operation, the vast majority of tmem +operations invoked by a client are +<kbd>put</kbd></big>s and <kbd>get</kbd></big>s, which require the various +data structures to be walked starting with the client_t, then +a pool_t, then an obj_t, then a pgd_t. +However, there is another highly frequent tmem operation that is not +visible from a client: memory reclamation. +Since tmem attempts to use all spare memory in the system, it must +frequently free up, or evict, +pages. The eviction algorithm will be +explained in more detail later but, in brief, to free memory, ephemeral pages +are removed from the tail of one of the doubly-linked lists, which means that +all of the data structures associated with that page-to-be-removed must be +updated or eliminated and freed. As a +result, each data structure also contains a back-pointer +to its parent, for example every obj_t +contains a pointer to its containing pool_t. + +This complex web of interconnected data structures is updated constantly and +thus extremely sensitive to careless code changes which, for example, may +result in unexpected hypervisor crashes or non-obvious memory leaks. +On the other hand, the code is fairly well +modularized so, once understood, it is possible to relatively easily switch out +one kind of data structure for another. +To catch problems as quickly as possible when debug is enabled, most of +the data structures are equipped with sentinelsand many inter-function +assumptions are documented and tested dynamically +with assertions. +While these clutter and lengthen the tmem +code substantially, their presence has proven invaluable on many occasions. + +For completeness, we should also describe a key data structure in the Xen +implementation-dependent code: the tmh_page_list. For security and +performance reasons, pages that are freed due to tmem operations (such +as <kbd>get</kbd></big>) are not immediately put back into Xen's pool +of free memory (aka the Xen heap). +Tmem pages may contain guest-private data that must be scrubbed before +those memory pages are released for the use of other guests. +But if a page is immediately re-used inside of tmem itself, the entire +page is overwritten with new data, so need not be scrubbed. +Since tmem is usually the most frequent +customer of the Xen heap allocation code, it would be a waste of time to scrub +a page, release it to the Xen heap, and then immediately re-allocate it +again. So, instead, tmem maintains +currently-unused pages of memory on its own free list, tmh_page_list, +and returns the pages to Xen only when non-tmem Xen +heap allocation requests would otherwise fail. + +<h2>Scalablility/Concurrency</h2> + +Tmem has been designed to be highly scalable. +Since tmem access is invoked similarly in +many ways to asynchronous disk access, a "big SMP" tmem-aware guest +OS can, and often will, invoke tmem hypercalls simultaneously on many different +physical CPUs. And, of course, multiple +tmem-aware guests may independently and simultaneously invoke tmem +hypercalls. While the normal frequency +of tmem invocations is rarely extremely high, some tmem operations such as data +compression or lookups in a very large tree may take tens of thousands of +cycles or more to complete. Measurements +have shown that normal workloads spend no more than about 0.2% (2% with +compression enabled) of CPU time executing tmem operations. +But those familiar with OS scalability issues +recognize that even this limited execution time can create concurrency problems +in large systems and result in poorly-scalable performance. + +A good locking strategy is critical to concurrency, but also +must be designed carefully to avoid deadlock and livelock problems. For +debugging purposes, tmem supports a "big kernel lock" which disables +concurrency altogether (enabled in Xen with "tmem_lock", but note +that this functionality is rarely tested and likely has bit-rotted). Infrequent +but invasive tmem hypercalls, such as pool creation or the control operations, +are serialized on a single read-write lock, called tmem_rwlock, +which must be held for writing. All other tmem operations must hold this lock +for reading, so frequent operations such as +<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big> can execute simultaneously +as long as no invasive operations are occurring. + +Once a pool has been selected, there is a per-pool +read-write lock (pool_rwlock) which +must be held for writing if any transformative operations might occur within +that pool, such as when an obj_t is +created or destroyed. For the highly +frequent operation of finding an obj_t +within a pool, pool_rwlock must be held for reading. + +Once an object has been selected, there is a per-object +spinlock (obj_spinlock). +This is a spinlock rather than a read-write +lock because nearly all of the most frequent tmem operations (e.g. +<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big>) +are transformative, in +that they add or remove a page within the object. +This lock is generally taken whenever an +object lookup occurs and released when the tmem operation is complete. + +Next, the per-client and global ephemeral lists are +protected by a single global spinlock (eph_lists_spinlock) +and the per-client persistent lists are also protected by a single global +spinlock (pers_list_spinlock). +And to complete the description of +implementation-independent locks, if page deduplication is enabled, all pages +for which the first byte match are contained in one of 256 trees that are +protected by one of 256 corresponding read-write locks +(pcd_tree_rwlocks). + +In the Xen-specific code (tmem_xen.c), page frames (e.g. struct page_info) +that have been released are kept in a list (tmh_page_list) that +is protected by a spinlock (tmh_page_list_lock). +There is also an "implied" lock +associated with compression, which is likely the most time-consuming operation +in all of tmem (of course, only when compression is enabled): A compression +buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to +this buffer is returned to implementation-independent code and used without a +lock. + +The proper method to avoid deadlocks is to take and release +locks in a very specific predetermined order. +Unfortunately, since tmem data structures must simultaneously be +accessed "top-down" ( +<kbd>put</kbd></big> and <kbd>get</kbd></big>) +and "bottoms-up" +(memory reclamation), more complex methods must be employed: +A trylockmechanism is used (c.f. tmem_try_to_evict_pgp()), +which takes the lock if it is available but returns immediately (rather than +spinning and waiting) if the lock is not available. +When walking the ephemeral list to identify +pages to free, any page that belongs to an object that is locked is simply +skipped. Further, if the page is the +last page belonging to an object, and the pool read-write lock for the pool the +object belongs to is not available (for writing), that object is skipped. +These constraints modify the LRU algorithm +somewhat, but avoid the potential for deadlock. + +Unfortunately, a livelock was still discovered in this approach: +When memory is scarce and each client is +<kbd>put</kbd></big>ting a large number of pages +for exactly one object (and thus holding the object spinlock for that object), +memory reclamation takes a very long time to determine that it is unable to +free any pages, and so the time to do a +<kbd>put</kbd></big> (which eventually fails) becomes linear to the +number of pages in the object! To avoid +this situation, a workaround was added to always ensure a minimum amount of +memory (1MB) is available before any object lock is taken for the client +invoking tmem (see tmem_ensure_avail_pages()). +Other such livelocks (and perhaps deadlocks) +may be lurking. + +A last issue related to concurrency is atomicity of counters. +Tmem gathers a large number of +statistics. Some of these counters are +informational only, while some are critical to tmem operation and must be +incremented and decremented atomically to ensure, for example, that the number +of pages in a tree never goes negative if two concurrent tmem operations access +the counter exactly simultaneously. Some +of the atomic counters are used for debugging (in assertions) and perhaps need +not be atomic; fixing these may increase performance slightly by reducing +cache-coherency traffic. Similarly, some +of the non-atomic counters may yield strange results to management tools, such +as showing the total number of successful +<kbd>put</kbd></big>s as being higher than the number of +<kbd>put</kbd></big>s attempted. +These are left as exercises for future tmem implementors. + +<h2>Control and Manageability</h2> + + +Tmem has a control interface to, for example, set various +parameters and obtain statistics. All +tmem control operations funnel through do_tmem_control() +and other functions supporting tmem control operations are prefixed +with tmemc_. + + +During normal operation, even if only one tmem-aware guest +is running, tmem may absorb nearly all free memory in the system for its own +use. Then if a management tool wishes to +create a new guest (or migrate a guest from another system to this one), it may +notice that there is insufficient "free" memory and fail the creation +(or migration). For this reason, tmem +introduces a new tool-visible class of memory -- freeable memory -- +and provides a control interface to access +it. All ephemeral memory and all pages on the tmh_page_list +are freeable. To properly access freeable +memory, a management tool must follow a sequence of steps: +<ul> +<li> +freeze +tmem:When tmem is frozen, all +<kbd>put</kbd></big>s fail, which ensures that no +additional memory may be absorbed by tmem. +(See tmemc_freeze_pools(), and +note that individual clients may be frozen, though this functionality may be +used only rarely.) +<li> +query freeable MB: If all freeable memory were released to the Xen +heap, this is the amount of memory (in MB) that would be freed. +See tmh_freeable_pages(). +<li> +flush: +Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g. +back to the Xen heap. This amount is +specified in KB. See tmemc_flush_mem() and tmem_relinquish_npages(). +<li> +At this point the management tool may allocate +the memory, e.g. using Xen's published interfaces. +<li> +thaw +tmem: This terminates the freeze, allowing tmem to accept +<kbd>put</kbd></big>s again. +</ul> + +Extensive tmem statistics are available through tmem's +control interface (see tmemc_list and +the separate source for the "xm tmem-list" command and the +xen-tmem-list-parse tool). To maximize +forward/backward compatibility with future tmem and tools versions, statistical +information is passed via an ASCII interface where each individual counter is +identified by an easily parseable two-letter ASCII sequence. + +<h2>Save/Restore/Migrate</h2> + + +Another piece of functionality that has a major impact on +the tmem code is support for save/restore of a tmem client and, highly related, +live migration of a tmem client. +Ephemeral pages, by definition, do not need to be saved or +live-migrated, but persistent pages are part of the state of a running VM and +so must be properly preserved. + +When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step +is for the tmem client to be frozen (see the manageability section). +Next, tmem API version information is +recorded (to avoid possible incompatibility issues as the tmem spec evolves in +the future). Then, certain high-level +tmem structural information specific to the client is recorded, including +information about the existing pools. +Finally, the contents of all persistent pages are recorded. + +For live-migration, the process is somewhat more complicated. +Ignoring tmem for a moment, recall that in +live migration, the vast majority of the VM's memory is transferred while the +VM is still fully operational. During +each phase, memory pages belonging to the VM that are changed are marked and +then retransmitted during a later phase. +Eventually only a small amount of memory remains, the VM is paused, the +remaining memory is transmitted, and the VM is unpaused on the target machine. + +The number of persistent tmem pages may be quite large, +possibly even larger than all the other memory used by the VM; so it is +unacceptable to transmit persistent tmem pages during the "paused" +phase of live migration. But if the VM +is still operational, it may be making calls to tmem: +A frozen tmem client will reject any +<big><kbd>put</kbd></big> operations, but tmem must +still correctly process <big><kbd>flush</kbd></big>es +(page and object), including implicit flushes due to duplicate +<big><kbd>put</kbd></big>s. +Fortunately, these operations can only +invalidate tmem pages, not overwrite tmem pages or create new pages. +So, when a live-migrate has been initiated, +the client is frozen. Then during the +"live" phase, tmem transmits all persistent pages, but also records +the handle of all persistent pages that are invalidated. +Then, during the "paused" phase, +only the handles of invalidated persistent pages are transmitted, resulting in +the invalidation on the target machine of any matching pages that were +previously transmitted during the "live" phase. + +For restore (and on the target machine of a live migration), +tmem must be capable of reconstructing the internal state of the client from +the saved/migrated data. However, it is +not the client itself that is <big><kbd>put</kbd></big>'ing +the pages but the management tools conducting the restore/migration. +This slightly complicates tmem by requiring +new API calls and new functions in the implementation, but the code is +structured so that duplication is minimized. +Once all tmem data structures for the client are reconstructed, all +persistent pages are recreated and, in the case of live-migration, all +invalidations have been processed and the client has been thawed, the restored +client can be resumed. + +Finally, tmem's data structures must be cluttered a bit to +support save/restore/migration. Notably, +a per-pool list of persistent pages must be maintained and, during live +migration, a per-client list of invalidated pages must be logged. +A reader of the code will note that these +lists are overlaid into space-sensitive data structures as a union, which may +be more error-prone but eliminates significant space waste. + +<h2>Miscellaneous Tmem Topics</h2> + + +Duplicate <big><kbd>puts</kbd></big>. +One interesting corner case that +significantly complicates the tmem source code is the possibility +of a duplicate +<big><kbd>put</kbd></big>, +which occurs when two +<big><kbd>put</kbd></big>s +are requested with the same handle but with possibly different data. +The tmem API addresses + +<big><kbd>put</kbd></big>-<big><kbd>put</kbd></big>-<big><kbd>get</kbd></big> +coherence explicitly: When a duplicate +<big><kbd>put</kbd></big> occurs, tmem may react one of two ways: (1) The +<big><kbd>put</kbd></big> may succeed with the old +data overwritten by the new data, or (2) the +<big><kbd>put</kbd></big> may be failed with the original data flushed and +neither the old nor the new data accessible. +Tmem may not fail the +<big><kbd>put</kbd></big> and leave the old data accessible. + +When tmem has been actively working for an extended period, +system memory may be in short supply and it is possible for a memory allocation +for a page (or even a data structure such as a pgd_t) to fail. Thus, +for a duplicate +<big><kbd>put</kbd></big>, it may be impossible for tmem to temporarily +simultaneously maintain data structures and data for both the original +<big><kbd>put</kbd></big> and the duplicate +<big><kbd>put</kbd></big>. +When the space required for the data is +identical, tmem may be able to overwrite in place the old data with +the new data (option 1). But in some circumstances, such as when data +is being compressed, overwriting is not always possible and option 2 must be +performed. + +Page deduplication and trailing-zero elimination. +When page deduplication is enabled +("tmem_dedup" option to Xen), ephemeral pages for which the contents +are identical -- whether the pages belong +to the same client or different clients -- utilize the same pageframe of +memory. In Xen environments where +multiple domains have a highly similar workload, this can save a substantial +amount of memory, allowing a much larger number of ephemeral pages to be +used. Tmem page deduplication uses +methods similar to the KSM implementation in Linux [ref], but differences between +the two are sufficiently great that tmem does not directly leverage the +code. In particular, ephemeral pages in +tmem are never dirtied, so need never be copied-on-write. +Like KSM, however, tmem avoids hashing, +instead employing red-black trees +that use the entire page contents as the lookup +key. There may be better ways to implement this. + +Dedup'ed pages may optionally be compressed +("tmem_compress" and "tmem_dedup" Xen options specified), +to save even more space, at the cost of more time. +Additionally, trailing zero elimination (tze) may be applied to dedup'ed +pages. With tze, pages that contain a +significant number of zeroes at the end of the page are saved without the trailing +zeroes; an all-zero page requires no data to be saved at all. +In certain workloads that utilize a large number +of small files (and for which the last partial page of a file is padded with +zeroes), a significant space savings can be realized without the high cost of +compression/decompression. + +Both compression and tze significantly complicate memory +allocation. This will be discussed more below. + +Memory accounting. +Accounting is boring, but poor accounting may +result in some interesting problems. In +the implementation-independent code of tmem, most data structures, page frames, +and partial pages (e.g. for compresssion) are billed to a pool, +and thus to a client. Some infrastructure data structures, such as +pools and clients, are allocated with tmh_alloc_infra(), which does not +require a pool to be specified. Two other +exceptions are page content descriptors (pcd_t) +and sharelists (sharelist_t) which +are explicitly not associated with a pool/client by specifying NULL instead of +a pool_t. +(Note to self: +These should probably just use the tmh_alloc_infra() interface too.) +As we shall see, persistent pool pages and +data structures may need to be handled a bit differently, so the +implementation-independent layer calls a different allocation/free routine for +persistent pages (e.g. tmh_alloc_page_thispool()) +than for ephemeral pages (e.g. tmh_alloc_page()). + +In the Xen-specific layer, we +disregard the pool_t for ephemeral +pages, as we use the generic Xen heap for all ephemeral pages and data +structures.(Denial-of-service attacks +can be handled in the implementation-independent layer because ephemeral pages +are kept in per-client queues each with a counted length. +See the discussion on weights and caps below.) +However we explicitly bill persistent pages +and data structures against the client/domain that is using them. +(See the calls to the Xen routine alloc_domheap_pages() in tmem_xen.h; of +the first argument is a domain, the pages allocated are billed by Xen to that +domain.)This means that a Xen domain +cannot allocate even a single tmem persistent page when it is currently utilizing +its maximum assigned memory allocation! +This is reasonable for persistent pages because, even though the data is +not directly accessible by the domain, the data is permanently saved until +either the domain flushes it or the domain dies. + +Note that proper accounting requires (even for ephemeral pools) that the same +pool is referenced when memory is freed as when it was allocated, even if the +ownership of a pool has been moved from one client to another (c.f. shared_pool_reassign()). +The underlying Xen-specific information may +not always enforce this for ephemeral pools, but incorrect alloc/free matching +can cause some difficult-to-find memory leaks and bent pointers. + +Page deduplication is not possible for persistent pools for +accounting reasons: Imagine a page that is created by persistent pool A, which +belongs to a domain that is currently well under its maximum allocation. +Then the pcd_tis matched by persistent pool B, which is +currently at its maximum. +Then the domain owning pool A is destroyed. +Is B beyond its maximum? +(There may be a clever way around this +problem. Exercise for the reader!) + +Memory allocation. The implementation-independent layer assumes +there is a good fast general-purpose dynamic memory allocator with bounded +response time and efficient use of memory for a very large number of sub-page +allocations. The old xmalloc memory +allocator in Xen was not a good match for this purpose, so was replaced by the +TLSF allocator. Note that the TLSF +allocator is used only for allocations smaller than a page (and, more +precisely, no larger than tmem_subpage_maxsize()); +full pages are allocated by Xen's normal heap allocator. + +After the TLSF allocator was integrated into Xen, more work +was required so that each client could allocate memory from a separate +independent pool. (See the call to xmem_pool_create()in +tmh_client_init().) +This allows the data structures allocated for the +purpose of supporting persistent pages to be billed to the same client as the +pages themselves. It also allows partial +(e.g. compressed) pages to be properly billed. +Further, when partial page allocations cause internal fragmentation, +this fragmentation can be isolated per-client. +And, when a domain dies, full pages can be freed, rather than only +partial pages. One other change was +required in the TLSF allocator: In the original version, when a TLSF memory +pool was allocated, the first page of memory was also allocated. +Since, for a persistent pool, this page would +be billed to the client, the allocation of the first page failed if the domain +was started at its maximum memory, and this resulted in a failure to create the +memory pool. To avoid this, the code was +changed to delay the allocation of the first page until first use of the memory +pool. + +Memory allocation interdependency. +As previously described, +pages of memory must be moveable back and forth between the Xen heap and the +tmem ephemeral lists (and page lists). +When tmem needs a page but doesn't have one, it requests one from the +Xen heap (either indirectly via xmalloc, or directly via Xen's alloc_domheap_pages()). +And when Xen needs a page but doesn't have +one, it requests one from tmem (via a call to tmem_relinquish_pages() in Xen's alloc_heap_pages() in page_alloc.c). +This leads to a potential infinite loop! +To break this loop, a new memory flag (MEMF_tmem) was added to Xen +to flag and disallow the loop. +See tmh_called_from_tmem() +in tmem_relinquish_pages(). +Note that the tmem_relinquish_pages() interface allows for memory requests of +order > 0 (multiple contiguous pages), but the tmem implementation disallows +any requests larger than a single page. + +LRU page reclamation. +Ephemeral pages generally age in +a queue, and the space associated with the oldest -- or least-recently-used -- page is reclaimed when tmem needs more +memory. But there are a few exceptions +to strict LRU queuing. First is when +removal from a queue is constrained by locks, as previously described above. +Second, when an ephemeral pool is shared, unlike a private ephemeral +pool, a +<big><kbd>get</kbd></big> +does not imply a +<big><kbd>flush</kbd></big> +Instead, in a shared pool, a +results in the page being promoted to the front of the queue. +Third, when a page that is deduplicated (i.e. +is referenced by more than one pgp_t) +reaches the end of the LRU queue, it is marked as eviction attempted and promoted to the front of the queue; if it +reaches the end of the queue a second time, eviction occurs. +Note that only the pgp_t is evicted; the actual data is only reclaimed if there is no +other pgp_t pointing to the data. + +All of these modified- LRU algorithms deserve to be studied +carefully against a broad range of workloads. + +Internal fragmentation. +When +compression or tze is enabled, allocations between a half-page and a full-page +in size are very common and this places a great deal of pressure on even the +best memory allocator. Additionally, +problems may be caused for memory reclamation: When one tmem ephemeral page is +evicted, only a fragment of a physical page of memory might be reclaimed. +As a result, when compression or tze is +enabled, it may take a very large number of eviction attempts to free up a full +contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction +must be assumed to be able to fail. +While all memory allocation paths in tmem are resilient to failure, very +complex corner cases may eventually occur. +As a result, compression and tze are disabled by default and should be +used with caution until they have been tested with a much broader set of +workloads.(Note to self: The +code needs work.) + +Weights and caps. +Because +of the just-discussed LRU-based eviction algorithms, a client that uses tmem at +a very high frequency can quickly swamp tmem so that it provides little benefit +to a client that uses it less frequently. +To reduce the possibility of this denial-of-service, limits can be +specified via management tools that are enforced internally by tmem. +On Xen, the "xm tmem-set" command +can specify "weight=<weight>" or "cap=<cap>" +for any client. If weight is non-zero +for a client and the current percentage of ephemeral pages in use by the client +exceeds its share (as measured by the sum of weights of all clients), the next +page chosen for eviction is selected from the requesting client's ephemeral +queue, instead of the global ephemeral queue that contains pages from all +clients.(See client_over_quota().) +Setting a cap for a client is currently a no-op. + +Shared pools and authentication. +When tmem was first proposed to the linux kernel mailing list +(LKML), there was concern expressed about security of shared ephemeral +pools. The initial tmem implementation only +required a client to provide a 128-bit UUID to identify a shared pool, and the +linux-side tmem implementation obtained this UUID from the superblock of the +shared filesystem (in ocfs2). It was +pointed out on LKML that the UUID was essentially a security key and any +malicious domain that guessed it would have access to any data from the shared +filesystem that found its way into tmem. +Ocfs2 has only very limited security; it is assumed that anyone who can +access the filesystem bits on the shared disk can mount the filesystem and use +it. But in a virtualized data center, +higher isolation requirements may apply. +As a result, a Xen boot option -- "tmem_shared_auth" -- was +added. The option defaults to disabled, +but when it is enabled, management tools must explicitly authenticate (or may +explicitly deny) shared pool access to any client. +On Xen, this is done with the "xm +tmem-shared-auth" command. + +32-bit implementation. +There was some effort put into getting tmem working on a 32-bit Xen. +However, the Xen heap is limited in size on +32-bit Xen so tmem did not work very well. +There are still 32-bit ifdefs in some places in the code, but things may +have bit-rotted so using tmem on a 32-bit Xen is not recommended. + +IA-64 implementation. +The vast majority of the tmem +implementation is architecture-independent. +For tmem to run on Xen/ia64, it is believed that only one or two +routines needs to be written.(See the +#ifdef __ia64__ at cli_mfn_to_va().) + +<h2>Known Issues</h2> + +Fragmentation.When tmem +is active, all physically memory becomes fragmented +into individual pages. However, the Xen +memory allocator allows memory to be requested in multi-page contiguous +quantities, called order>0 allocations. +(e.g. 2order so +order==4 is sixteen contiguous pages.) +In some cases, a request for a larger order will fail gracefully if no +matching contiguous allocation is available from Xen. +As of Xen 4.0, however, there are several +critical order>0 allocation requests that do not fail gracefully. +Notably, when a domain is created, and +order==4 structure is required or the domain creation will fail. +And shadow paging requires many order==2 +allocations; if these fail, a PV live-migration may fail. +There are likely other such issues. + +But, fragmentation can occur even without tmem if any domU does +any extensive ballooning; tmem just accelerates the fragmentation. +So the fragmentation problem must be solved +anyway. The best solution is to disallow +order>0 allocations altogether in Xen -- or at least ensure that any attempt +to allocate order>0 can fail gracefully, e.g. by falling back to a sequence +of single page allocations. However this restriction may require a major rewrite +in some of Xen's most sensitive code. +(Note that order>0 allocations during Xen boot and early in domain0 +launch are safe and, if dom0 does not enable tmem, any order>0 allocation by +dom0 is safe, until the first domU is created.) + +Until Xen can be rewritten to be fragmentation-safe, a small hack +was added in the Xen page +allocator.(See the comment " +memory is scarce" in alloc_heap_pages().) +Briefly, a portion of memory is pre-reserved +for allocations where order>0 and order<9. +(Domain creation uses 2MB pages, but fails +gracefully, and there are no other known order==9 allocations or order>9 +allocations currently in Xen.) + +NUMA. Tmem assumes that +all memory pages are equal and any RAM page can store a page of data for any +client. This has potential performance +consequences in any NUMA machine where access to far memory is significantly slower than access to near memory. +On nearly all of today's servers, however, +access times to far memory is still +much faster than access to disk or network-based storage, and tmem's primary performance +advantage comes from the fact that paging and swapping are reduced. +So, the current tmem implementation ignores +NUMA-ness; future tmem design for NUMA machines is an exercise left for the +reader. + +<h2>Bibliography</h2> + + +(needs work) +<a href="http://oss.oracle.com/projects/tmem";>http://oss.oracle.com/projects/tmem</a> _______________________________________________ Xen-changelog mailing list Xen-changelog@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-changelog

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.