[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-changelog] [xen-unstable] docs: Add tmem documentation
# HG changeset patch # User Keir Fraser <keir.fraser@xxxxxxxxxx> # Date 1271835091 -3600 # Node ID c7d7797656dfe6c41ab59954a7bba1a2748c6836 # Parent b36467432effd0c7a94c2677df978b509c0497f7 docs: Add tmem documentation Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx> --- docs/misc/tmem-internals.html | 798 ++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 798 insertions(+) diff -r b36467432eff -r c7d7797656df docs/misc/tmem-internals.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/docs/misc/tmem-internals.html Wed Apr 21 08:31:31 2010 +0100 @@ -0,0 +1,798 @@ +<h1>Transcendent Memory Internals in Xen</h1> +<P> +by Dan Magenheimer, Oracle Corp.</p> +<P> +Draft 0.1 -- Updated: 20100324 +<h2>Overview</h2> +<P> +This document focuses on the internal implementation of +Transcendent Memory (tmem) on Xen. It assumes +that the reader has a basic knowledge of the terminology, objectives, and +functionality of tmem and also has access to the Xen source code. +It corresponds to the Xen 4.0 release, with +patch added to support page deduplication (V2). +<P> +The primary responsibilities of the tmem implementation are to: +<ul> +<li>manage a potentially huge and extremely dynamic +number of memory pages from a potentially large number of clients (domains) +with low memory overhead and proper isolation +<li>provide quick and efficient access to these +pages with as much concurrency as possible +<li>enable efficient reclamation and <i>eviction</i> of pages (e.g. when +memory is fully utilized) +<li>optionally, increase page density through compression and/or +deduplication +<li>where necessary, properly assign and account for +memory belonging to guests to avoid malicious and/or accidental unfairness +and/or denial-of-service +<li>record utilization statistics and make them available to management tools +</ul> +<h2>Source Code Organization</h2> + +<P> +The source code in Xen that provides the tmem functionality +is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h. +The files tmem.c and tmem.h are intended to +be implementation- (and hypervisor-) independent and the other two files +provide the Xen-specific code. This +division is intended to make it easier to port tmem functionality to other +hypervisors, though at this time porting to other hypervisors has not been +attempted. Together, these four files +total less than 4000 lines of C code. +<P> +Even ignoring the implementation-specific functionality, the +implementation-independent part of tmem has several dependencies on +library functionality (Xen source filenames in parentheses): +<ul> +<li> +a good fast general-purpose dynamic memory +allocator with bounded response time and efficient use of memory for a very +large number of sub-page allocations. To +achieve this in Xen, the bad old memory allocator was replaced with a +slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by +Nitin Gupta for compcache. +<li> +good tree data structure libraries, specifically +<i>red-black</i> trees (rbtree.c) and <i>radix</i> trees (radix-tree.c). +Code for these was borrowed for Linux and adapted for tmem and Xen. +<li> +good locking and list code. Both of these existed in Xen and required +little or no change. +<li> +optionally, a good fast lossless compression +library. The Xen implementation added to +support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta. +</ul> +<P> +More information about the specific functionality of these +libraries can easily be found through a search engine, via wikipedia, or in the +Xen or Linux source logs so we will not elaborate further here. + +<h2>Prefixes/Abbreviations/Glossary</h2> + +<P> +The tmem code uses several prefixes and abbreviations. +Knowledge of these will improve code readability: +<ul> +<li> +<i>tmh</i> == +transcendent memory host. Functions or +data structures that are defined by the implementation-specific code, i.e. the +Xen host code +<li> +<i>tmemc</i> +== transcendent memory control. +Functions or data structures that provide management tool functionality, +rather than core tmem operations. +<li> +<i>cli </i>or +<i>client</i> == client. +The tmem generic term for a domain or a guest OS. +</ul> +<P> +When used in prose, common tmem operations are indicated +with a different font, such as <big><kbd>put</kbd></big> +and <big><kbd>get</kbd></big>. + +<h2>Key Data Structures</h2> + +<P> +To manage a huge number of pages, efficient data structures +must be carefully selected. +<P> +Recall that a tmem-enabled guest OS may create one or more +pools with different attributes. It then +<kbd>put</kbd></big>s and <kbd>get</kbd></big>s +pages to/from this pool, identifying the page +with a <i>handle</i> that consists of a <i>pool_id</i>, an <i> +object_id</i>, and a <i>page_id </i>(sometimes +called an <i>index</i>). +This suggests a few obvious core data +structures: +<ul> +<li> +When a guest OS first calls tmem, a <i>client_t</i> is created to contain +and track all uses of tmem by that guest OS. Among +other things, a <i>client_t</i> keeps pointers +to a fixed number of pools (16 in the current Xen implementation). +<li> +When a guest OS requests a new pool, a <i>pool_t</i> is created. +Some pools are shared and are kept in a +sharelist (<i>sharelist_t</i>) which points +to all the clients that are sharing the pool. +Since an <i>object_id</i> is 64-bits, +a <i>pool_t</i> must be able to keep track +of a potentially very large number of objects. +To do so, it maintains a number of parallel trees (256 in the current +Xen implementation) and a hash algorithm is applied to the <i>object_id</i> +to select the correct tree. +Each tree element points to an object. +Because an <i>object_id</i> usually represents an <i>inode</i> +(a unique file number identifier), and <i>inode</i> numbers +are fairly random, though often "clumpy", a <i>red-black tree</i> +is used. +<li> +When a guest first +<kbd>put</kbd></big>s a page to a pool with an as-yet-unused <i>object_id,</i> an +<i>obj_t</i> is created. Since a <i +>page_id</i> is usually an index into a file, +it is often a small number, but may sometimes be very large (up to +32-bits). A <i>radix tree</i> is a good data structure to contain items +with this kind of index distribution. +<li> +When a page is +<kbd>put</kbd></big>, a page descriptor, or <i>pgp_t</i>, is created, which +among other things will point to the storage location where the data is kept. +In the normal case the pointer is to a <i>pfp_t</i>, which is an +implementation-specific datatype representing a physical pageframe in memory +(which in Xen is a "struct page_info"). +When deduplication is enabled, it points to +yet another data structure, a <i>pcd_</i>t +(see below). When compression is enabled +(and deduplication is not), the pointer points directly to the compressed data. +For reasons we will see shortly, each <i>pgp_t</i> that represents +an <i>ephemeral</i> page (that is, a page placed +in an <i>ephemeral</i> pool) is also placed +into two doubly-linked linked lists, one containing all ephemeral pages +<kbd>put</kbd></big> by the same client and one +containing all ephemeral pages across all clients ("global"). +<li> +When deduplication is enabled, multiple <i>pgp_</i>t's may need to point to +the same data, so another data structure (and level of indirection) is used +called a page content descriptor, or <i>pcd_t</i>. +Multiple page descriptors (<i>pgp_t</i>'s) may point to the same <i>pcd_t</i>. +The <i>pcd_t</i>, in turn, points to either a <i>pfp_t</i> +(if a full page of data), directly to a +location in memory (if the page has been compressed or trailing zeroes have +been eliminated), or even a NULL pointer (if the page contained all zeroes and +trailing zero elimination is enabled). +</ul> +<P> +The most apparent usage of this multi-layer web of data structures +is "top-down" because, in normal operation, the vast majority of tmem +operations invoked by a client are +<kbd>put</kbd></big>s and <kbd>get</kbd></big>s, which require the various +data structures to be walked starting with the <i>client_t</i>, then +a <i>pool_t</i>, then an <i>obj_t</i>, then a <i>pgd_t</i>. +However, there is another highly frequent tmem operation that is not +visible from a client: memory reclamation. +Since tmem attempts to use all spare memory in the system, it must +frequently free up, or <i>evict</i>, +pages. The eviction algorithm will be +explained in more detail later but, in brief, to free memory, ephemeral pages +are removed from the tail of one of the doubly-linked lists, which means that +all of the data structures associated with that page-to-be-removed must be +updated or eliminated and freed. As a +result, each data structure also contains a <i>back-pointer</i> +to its parent, for example every <i>obj_t</i> +contains a pointer to its containing <i>pool_t</i>. +<P> +This complex web of interconnected data structures is updated constantly and +thus extremely sensitive to careless code changes which, for example, may +result in unexpected hypervisor crashes or non-obvious memory leaks. +On the other hand, the code is fairly well +modularized so, once understood, it is possible to relatively easily switch out +one kind of data structure for another. +To catch problems as quickly as possible when debug is enabled, most of +the data structures are equipped with <i>sentinels</i>and many inter-function +assumptions are documented and tested dynamically +with <i>assertions</i>. +While these clutter and lengthen the tmem +code substantially, their presence has proven invaluable on many occasions. +<P> +For completeness, we should also describe a key data structure in the Xen +implementation-dependent code: the <i>tmh_page_list</i>. For security and +performance reasons, pages that are freed due to tmem operations (such +as <kbd>get</kbd></big>) are not immediately put back into Xen's pool +of free memory (aka the Xen <i>heap</i>). +Tmem pages may contain guest-private data that must be <i>scrubbed</i> before +those memory pages are released for the use of other guests. +But if a page is immediately re-used inside of tmem itself, the entire +page is overwritten with new data, so need not be scrubbed. +Since tmem is usually the most frequent +customer of the Xen heap allocation code, it would be a waste of time to scrub +a page, release it to the Xen heap, and then immediately re-allocate it +again. So, instead, tmem maintains +currently-unused pages of memory on its own free list, <i>tmh_page_list</i>, +and returns the pages to Xen only when non-tmem Xen +heap allocation requests would otherwise fail. + +<h2>Scalablility/Concurrency</h2> + +<P>Tmem has been designed to be highly scalable. +Since tmem access is invoked similarly in +many ways to asynchronous disk access, a "big SMP" tmem-aware guest +OS can, and often will, invoke tmem hypercalls simultaneously on many different +physical CPUs. And, of course, multiple +tmem-aware guests may independently and simultaneously invoke tmem +hypercalls. While the normal frequency +of tmem invocations is rarely extremely high, some tmem operations such as data +compression or lookups in a very large tree may take tens of thousands of +cycles or more to complete. Measurements +have shown that normal workloads spend no more than about 0.2% (2% with +compression enabled) of CPU time executing tmem operations. +But those familiar with OS scalability issues +recognize that even this limited execution time can create concurrency problems +in large systems and result in poorly-scalable performance. +<P> +A good locking strategy is critical to concurrency, but also +must be designed carefully to avoid deadlock and <i>livelock</i> problems. For +debugging purposes, tmem supports a "big kernel lock" which disables +concurrency altogether (enabled in Xen with "tmem_lock", but note +that this functionality is rarely tested and likely has bit-rotted). Infrequent +but invasive tmem hypercalls, such as pool creation or the control operations, +are serialized on a single <i>read-write lock</i>, called tmem_rwlock, +which must be held for writing. All other tmem operations must hold this lock +for reading, so frequent operations such as +<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big> can execute simultaneously +as long as no invasive operations are occurring. +<P> +Once a pool has been selected, there is a per-pool +read-write lock (<i>pool_rwlock</i>) which +must be held for writing if any transformative operations might occur within +that pool, such as when an<i> obj_t</i> is +created or destroyed. For the highly +frequent operation of finding an<i> obj_t</i> +within a pool, pool_rwlock must be held for reading. +<P> +Once an object has been selected, there is a per-object +spinlock (<i>obj_spinlock)</i>. +This is a spinlock rather than a read-write +lock because nearly all of the most frequent tmem operations (e.g. +<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big>) +are transformative, in +that they add or remove a page within the object. +This lock is generally taken whenever an +object lookup occurs and released when the tmem operation is complete. +<P> +Next, the per-client and global ephemeral lists are +protected by a single global spinlock (<i>eph_lists_</i>spinlock) +and the per-client persistent lists are also protected by a single global +spinlock (<i>pers_list_spinlock</i>). +And to complete the description of +implementation-independent locks, if page deduplication is enabled, all pages +for which the first byte match are contained in one of 256 trees that are +protected by one of 256 corresponding read-write locks +(<i>pcd_tree_rwlocks</i>). +<P> +In the Xen-specific code (tmem_xen.c), page frames (e.g. struct page_info) +that have been released are kept in a list (<i>tmh_page_list</i>) that +is protected by a spinlock (<i>tmh_page_list_lock</i>). +There is also an "implied" lock +associated with compression, which is likely the most time-consuming operation +in all of tmem (of course, only when compression is enabled): A compression +buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to +this buffer is returned to implementation-independent code and used without a +lock. +<P> +The proper method to avoid deadlocks is to take and release +locks in a very specific predetermined order. +Unfortunately, since tmem data structures must simultaneously be +accessed "top-down" ( +<kbd>put</kbd></big> and <kbd>get</kbd></big>) +and "bottoms-up" +(memory reclamation), more complex methods must be employed: +A <i>trylock</i>mechanism is used (c.f. <i>tmem_try_to_evict_pgp()</i>), +which takes the lock if it is available but returns immediately (rather than +spinning and waiting) if the lock is not available. +When walking the ephemeral list to identify +pages to free, any page that belongs to an object that is locked is simply +skipped. Further, if the page is the +last page belonging to an object, and the pool read-write lock for the pool the +object belongs to is not available (for writing), that object is skipped. +These constraints modify the LRU algorithm +somewhat, but avoid the potential for deadlock. +<P> +Unfortunately, a livelock was still discovered in this approach: +When memory is scarce and each client is +<kbd>put</kbd></big>ting a large number of pages +for exactly one object (and thus holding the object spinlock for that object), +memory reclamation takes a very long time to determine that it is unable to +free any pages, and so the time to do a +<kbd>put</kbd></big> (which eventually fails) becomes linear to the +number of pages in the object! To avoid +this situation, a workaround was added to always ensure a minimum amount of +memory (1MB) is available before any object lock is taken for the client +invoking tmem (see <i>tmem_ensure_avail_pages()</i>). +Other such livelocks (and perhaps deadlocks) +may be lurking. +<P> +A last issue related to concurrency is atomicity of counters. +Tmem gathers a large number of +statistics. Some of these counters are +informational only, while some are critical to tmem operation and must be +incremented and decremented atomically to ensure, for example, that the number +of pages in a tree never goes negative if two concurrent tmem operations access +the counter exactly simultaneously. Some +of the atomic counters are used for debugging (in assertions) and perhaps need +not be atomic; fixing these may increase performance slightly by reducing +cache-coherency traffic. Similarly, some +of the non-atomic counters may yield strange results to management tools, such +as showing the total number of successful +<kbd>put</kbd></big>s as being higher than the number of +<kbd>put</kbd></big>s attempted. +These are left as exercises for future tmem implementors. + +<h2>Control and Manageability</h2> + +<P> +Tmem has a control interface to, for example, set various +parameters and obtain statistics. All +tmem control operations funnel through <i>do_tmem_control()</i> +and other functions supporting tmem control operations are prefixed +with <i>tmemc_</i>. + +<P> +During normal operation, even if only one tmem-aware guest +is running, tmem may absorb nearly all free memory in the system for its own +use. Then if a management tool wishes to +create a new guest (or migrate a guest from another system to this one), it may +notice that there is insufficient "free" memory and fail the creation +(or migration). For this reason, tmem +introduces a new tool-visible class of memory -- <i>freeable</i> memory -- +and provides a control interface to access +it. All ephemeral memory and all pages on the <i>tmh_page_list</i> +are freeable. To properly access freeable +memory, a management tool must follow a sequence of steps: +<ul> +<li> +<i>freeze</i> +tmem:When tmem is frozen, all +<kbd>put</kbd></big>s fail, which ensures that no +additional memory may be absorbed by tmem. +(See <i>tmemc_freeze_pools()</i>, and +note that individual clients may be frozen, though this functionality may be +used only rarely.) +<li> +<i>query freeable MB: </i>If all freeable memory were released to the Xen +heap, this is the amount of memory (in MB) that would be freed. +See <i>tmh_freeable_pages()</i>. +<li> +<i>flush</i>: +Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g. +back to the Xen heap. This amount is +specified in KB. See <i +>tmemc_flush_mem()</i> and <i +>tmem_relinquish_npages()</i>. +<li> +At this point the management tool may allocate +the memory, e.g. using Xen's published interfaces. +<li> +<i>thaw</i> +tmem: This terminates the freeze, allowing tmem to accept +<kbd>put</kbd></big>s again. +</ul> +<P> +Extensive tmem statistics are available through tmem's +control interface (see <i>tmemc_list </i>and +the separate source for the "xm tmem-list" command and the +xen-tmem-list-parse tool). To maximize +forward/backward compatibility with future tmem and tools versions, statistical +information is passed via an ASCII interface where each individual counter is +identified by an easily parseable two-letter ASCII sequence. + +<h2>Save/Restore/Migrate</h2> + +<P> +Another piece of functionality that has a major impact on +the tmem code is support for save/restore of a tmem client and, highly related, +live migration of a tmem client. +Ephemeral pages, by definition, do not need to be saved or +live-migrated, but persistent pages are part of the state of a running VM and +so must be properly preserved. +<P> +When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step +is for the tmem client to be frozen (see the manageability section). +Next, tmem API version information is +recorded (to avoid possible incompatibility issues as the tmem spec evolves in +the future). Then, certain high-level +tmem structural information specific to the client is recorded, including +information about the existing pools. +Finally, the contents of all persistent pages are recorded. +<P> +For live-migration, the process is somewhat more complicated. +Ignoring tmem for a moment, recall that in +live migration, the vast majority of the VM's memory is transferred while the +VM is still fully operational. During +each phase, memory pages belonging to the VM that are changed are marked and +then retransmitted during a later phase. +Eventually only a small amount of memory remains, the VM is paused, the +remaining memory is transmitted, and the VM is unpaused on the target machine. +<P> +The number of persistent tmem pages may be quite large, +possibly even larger than all the other memory used by the VM; so it is +unacceptable to transmit persistent tmem pages during the "paused" +phase of live migration. But if the VM +is still operational, it may be making calls to tmem: +A frozen tmem client will reject any +<big><kbd>put</kbd></big> operations, but tmem must +still correctly process <big><kbd>flush</kbd></big>es +(page and object), including implicit flushes due to duplicate +<big><kbd>put</kbd></big>s. +Fortunately, these operations can only +invalidate tmem pages, not overwrite tmem pages or create new pages. +So, when a live-migrate has been initiated, +the client is frozen. Then during the +"live" phase, tmem transmits all persistent pages, but also records +the handle of all persistent pages that are invalidated. +Then, during the "paused" phase, +only the handles of invalidated persistent pages are transmitted, resulting in +the invalidation on the target machine of any matching pages that were +previously transmitted during the "live" phase. +<P> +For restore (and on the target machine of a live migration), +tmem must be capable of reconstructing the internal state of the client from +the saved/migrated data. However, it is +not the client itself that is <big><kbd>put</kbd></big>'ing +the pages but the management tools conducting the restore/migration. +This slightly complicates tmem by requiring +new API calls and new functions in the implementation, but the code is +structured so that duplication is minimized. +Once all tmem data structures for the client are reconstructed, all +persistent pages are recreated and, in the case of live-migration, all +invalidations have been processed and the client has been thawed, the restored +client can be resumed. +<P> +Finally, tmem's data structures must be cluttered a bit to +support save/restore/migration. Notably, +a per-pool list of persistent pages must be maintained and, during live +migration, a per-client list of invalidated pages must be logged. +A reader of the code will note that these +lists are overlaid into space-sensitive data structures as a union, which may +be more error-prone but eliminates significant space waste. + +<h2>Miscellaneous Tmem Topics</h2> + +<P> +<i><b>Duplicate <big><kbd>puts</kbd></big></b></i>. +One interesting corner case that +significantly complicates the tmem source code is the possibility +of a <i>duplicate</i> +<big><kbd>put</kbd></big>, +which occurs when two +<big><kbd>put</kbd></big>s +are requested with the same handle but with possibly different data. +The tmem API addresses +<i> +<big><kbd>put</kbd></big>-<big><kbd>put</kbd></big>-<big><kbd>get</kbd></big> +coherence</i> explicitly: When a duplicate +<big><kbd>put</kbd></big> occurs, tmem may react one of two ways: (1) The +<big><kbd>put</kbd></big> may succeed with the old +data overwritten by the new data, or (2) the +<big><kbd>put</kbd></big> may be failed with the original data flushed and +neither the old nor the new data accessible. +Tmem may <i>not</i> fail the +<big><kbd>put</kbd></big> and leave the old data accessible. +<P> +When tmem has been actively working for an extended period, +system memory may be in short supply and it is possible for a memory allocation +for a page (or even a data structure such as a <i>pgd_t</i>) to fail. Thus, +for a duplicate +<big><kbd>put</kbd></big>, it may be impossible for tmem to temporarily +simultaneously maintain data structures and data for both the original +<big><kbd>put</kbd></big> and the duplicate +<big><kbd>put</kbd></big>. +When the space required for the data is +identical, tmem may be able to overwrite <i>in place </i>the old data with +the new data (option 1). But in some circumstances, such as when data +is being compressed, overwriting is not always possible and option 2 must be +performed. +<P> +<i><b>Page deduplication and trailing-zero elimination.</b></i> +When page deduplication is enabled +("tmem_dedup" option to Xen), ephemeral pages for which the contents +are identical -- whether the pages belong +to the same client or different clients -- utilize the same pageframe of +memory. In Xen environments where +multiple domains have a highly similar workload, this can save a substantial +amount of memory, allowing a much larger number of ephemeral pages to be +used. Tmem page deduplication uses +methods similar to the KSM implementation in Linux [ref], but differences between +the two are sufficiently great that tmem does not directly leverage the +code. In particular, ephemeral pages in +tmem are never dirtied, so need never be <i>copied-on-write</i>. +Like KSM, however, tmem avoids hashing, +instead employing <i>red-black trees</i> +that use the entire page contents as the <i>lookup +key</i>. There may be better ways to implement this. +<P> +Dedup'ed pages may optionally be compressed +("tmem_compress" and "tmem_dedup" Xen options specified), +to save even more space, at the cost of more time. +Additionally, <i>trailing zero elimination (tze)</i> may be applied to dedup'ed +pages. With tze, pages that contain a +significant number of zeroes at the end of the page are saved without the trailing +zeroes; an all-zero page requires no data to be saved at all. +In certain workloads that utilize a large number +of small files (and for which the last partial page of a file is padded with +zeroes), a significant space savings can be realized without the high cost of +compression/decompression. +<P> +Both compression and tze significantly complicate memory +allocation. This will be discussed more below. +<P> +<b><i>Memory accounting</i>.</b> +Accounting is boring, but poor accounting may +result in some interesting problems. In +the implementation-independent code of tmem, most data structures, page frames, +and partial pages (e.g. for compresssion) are <i>billed</i> to a pool, +and thus to a client. Some <i>infrastructure</i> data structures, such as +pools and clients, are allocated with <i>tmh_alloc_infra()</i>, which does not +require a pool to be specified. Two other +exceptions are page content descriptors (<i>pcd_t</i>) +and sharelists (<i>sharelist_t</i>) which +are explicitly not associated with a pool/client by specifying NULL instead of +a <i>pool_t</i>. +(Note to self: +These should probably just use the <i>tmh_alloc_infra()</i> interface too.) +As we shall see, persistent pool pages and +data structures may need to be handled a bit differently, so the +implementation-independent layer calls a different allocation/free routine for +persistent pages (e.g. <i>tmh_alloc_page_thispool()</i>) +than for ephemeral pages (e.g. <i>tmh_alloc_page()</i>). +<P> +In the Xen-specific layer, we +disregard the <i>pool_t</i> for ephemeral +pages, as we use the generic Xen heap for all ephemeral pages and data +structures.(Denial-of-service attacks +can be handled in the implementation-independent layer because ephemeral pages +are kept in per-client queues each with a counted length. +See the discussion on weights and caps below.) +However we explicitly bill persistent pages +and data structures against the client/domain that is using them. +(See the calls to the Xen routine <i>alloc_domheap_pages() </i>in tmem_xen.h; of +the first argument is a domain, the pages allocated are billed by Xen to that +domain.)This means that a Xen domain +cannot allocate even a single tmem persistent page when it is currently utilizing +its maximum assigned memory allocation! +This is reasonable for persistent pages because, even though the data is +not directly accessible by the domain, the data is permanently saved until +either the domain flushes it or the domain dies. +<P> +Note that proper accounting requires (even for ephemeral pools) that the same +pool is referenced when memory is freed as when it was allocated, even if the +ownership of a pool has been moved from one client to another (c.f. <i +>shared_pool_reassign()</i>). +The underlying Xen-specific information may +not always enforce this for ephemeral pools, but incorrect alloc/free matching +can cause some difficult-to-find memory leaks and bent pointers. +<P> +Page deduplication is not possible for persistent pools for +accounting reasons: Imagine a page that is created by persistent pool A, which +belongs to a domain that is currently well under its maximum allocation. +Then the <i>pcd_t</i>is matched by persistent pool B, which is +currently at its maximum. +Then the domain owning pool A is destroyed. +Is B beyond its maximum? +(There may be a clever way around this +problem. Exercise for the reader!) +<P> +<b><i>Memory allocation.</i></b> The implementation-independent layer assumes +there is a good fast general-purpose dynamic memory allocator with bounded +response time and efficient use of memory for a very large number of sub-page +allocations. The old xmalloc memory +allocator in Xen was not a good match for this purpose, so was replaced by the +TLSF allocator. Note that the TLSF +allocator is used only for allocations smaller than a page (and, more +precisely, no larger than <i>tmem_subpage_maxsize()</i>); +full pages are allocated by Xen's normal heap allocator. +<P> +After the TLSF allocator was integrated into Xen, more work +was required so that each client could allocate memory from a separate +independent pool. (See the call to <i>xmem_pool_create()</i>in +<i>tmh_client_init()</i>.) +This allows the data structures allocated for the +purpose of supporting persistent pages to be billed to the same client as the +pages themselves. It also allows partial +(e.g. compressed) pages to be properly billed. +Further, when partial page allocations cause internal fragmentation, +this fragmentation can be isolated per-client. +And, when a domain dies, full pages can be freed, rather than only +partial pages. One other change was +required in the TLSF allocator: In the original version, when a TLSF memory +pool was allocated, the first page of memory was also allocated. +Since, for a persistent pool, this page would +be billed to the client, the allocation of the first page failed if the domain +was started at its maximum memory, and this resulted in a failure to create the +memory pool. To avoid this, the code was +changed to delay the allocation of the first page until first use of the memory +pool. +<P> +<b><i>Memory allocation interdependency.</i></b> +As previously described, +pages of memory must be moveable back and forth between the Xen heap and the +tmem ephemeral lists (and page lists). +When tmem needs a page but doesn't have one, it requests one from the +Xen heap (either indirectly via xmalloc, or directly via Xen's <i +>alloc_domheap_pages()</i>). +And when Xen needs a page but doesn't have +one, it requests one from tmem (via a call to <i +>tmem_relinquish_pages()</i> in Xen's <i +>alloc_heap_pages() </i>in page_alloc.c). +This leads to a potential infinite loop! +To break this loop, a new memory flag (<i>MEMF_tmem</i>) was added to Xen +to flag and disallow the loop. +See <i>tmh_called_from_tmem()</i> +in <i>tmem_relinquish_pages()</i>. +Note that the <i +>tmem_relinquish_pages()</i> interface allows for memory requests of +order > 0 (multiple contiguous pages), but the tmem implementation disallows +any requests larger than a single page. +<P> +<b><i>LRU page reclamation</i></b>. +Ephemeral pages generally <i>age </i>in +a queue, and the space associated with the oldest -- or <i +>least-recently-used -- </i>page is reclaimed when tmem needs more +memory. But there are a few exceptions +to strict LRU queuing. First is when +removal from a queue is constrained by locks, as previously described above. +Second, when an ephemeral pool is <i>shared,</i> unlike a private ephemeral +pool, a +<big><kbd>get</kbd></big> +does not imply a +<big><kbd>flush</kbd></big> +Instead, in a shared pool, a +results in the page being promoted to the front of the queue. +Third, when a page that is deduplicated (i.e. +is referenced by more than one <i>pgp_</i>t) +reaches the end of the LRU queue, it is marked as <i +>eviction attempted</i> and promoted to the front of the queue; if it +reaches the end of the queue a second time, eviction occurs. +Note that only the <i +>pgp_</i>t is evicted; the actual data is only reclaimed if there is no +other <i>pgp_t </i>pointing to the data. +<P> +All of these modified- LRU algorithms deserve to be studied +carefully against a broad range of workloads. +<P> +<b><i>Internal fragmentation</i>.</b> +When +compression or tze is enabled, allocations between a half-page and a full-page +in size are very common and this places a great deal of pressure on even the +best memory allocator. Additionally, +problems may be caused for memory reclamation: When one tmem ephemeral page is +evicted, only a fragment of a physical page of memory might be reclaimed. +As a result, when compression or tze is +enabled, it may take a very large number of eviction attempts to free up a full +contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction +must be assumed to be able to fail. +While all memory allocation paths in tmem are resilient to failure, very +complex corner cases may eventually occur. +As a result, compression and tze are disabled by default and should be +used with caution until they have been tested with a much broader set of +workloads.(Note to self: The +code needs work.) +<P> +<b><i>Weights and caps</i>.</b> +Because +of the just-discussed LRU-based eviction algorithms, a client that uses tmem at +a very high frequency can quickly swamp tmem so that it provides little benefit +to a client that uses it less frequently. +To reduce the possibility of this denial-of-service, limits can be +specified via management tools that are enforced internally by tmem. +On Xen, the "xm tmem-set" command +can specify "weight=<weight>" or "cap=<cap>" +for any client. If weight is non-zero +for a client and the current percentage of ephemeral pages in use by the client +exceeds its share (as measured by the sum of weights of all clients), the next +page chosen for eviction is selected from the requesting client's ephemeral +queue, instead of the global ephemeral queue that contains pages from all +clients.(See <i>client_over_quota().</i>) +Setting a cap for a client is currently a no-op. +<P> +<b><i>Shared pools and authentication.</i></b> +When tmem was first proposed to the linux kernel mailing list +(LKML), there was concern expressed about security of shared ephemeral +pools. The initial tmem implementation only +required a client to provide a 128-bit UUID to identify a shared pool, and the +linux-side tmem implementation obtained this UUID from the superblock of the +shared filesystem (in ocfs2). It was +pointed out on LKML that the UUID was essentially a security key and any +malicious domain that guessed it would have access to any data from the shared +filesystem that found its way into tmem. +Ocfs2 has only very limited security; it is assumed that anyone who can +access the filesystem bits on the shared disk can mount the filesystem and use +it. But in a virtualized data center, +higher isolation requirements may apply. +As a result, a Xen boot option -- "tmem_shared_auth" -- was +added. The option defaults to disabled, +but when it is enabled, management tools must explicitly authenticate (or may +explicitly deny) shared pool access to any client. +On Xen, this is done with the "xm +tmem-shared-auth" command. +<P> +<b><i>32-bit implementation</i>.</b> +There was some effort put into getting tmem working on a 32-bit Xen. +However, the Xen heap is limited in size on +32-bit Xen so tmem did not work very well. +There are still 32-bit ifdefs in some places in the code, but things may +have bit-rotted so using tmem on a 32-bit Xen is not recommended. +<P> +<b><i>IA-64 implementation. </i></b> +The vast majority of the tmem +implementation is architecture-independent. +For tmem to run on Xen/ia64, it is believed that only one or two +routines needs to be written.(See the +#ifdef __ia64__ at <i>cli_mfn_to_va()</i>.) + +<h2>Known Issues</h2> + +<p><b><i>Fragmentation.</i></b>When tmem +is active, all physically memory becomes <i>fragmented</i> +into individual pages. However, the Xen +memory allocator allows memory to be requested in multi-page contiguous +quantities, called order>0 allocations. +(e.g. 2<sup>order</sup> so +order==4 is sixteen contiguous pages.) +In some cases, a request for a larger order will fail gracefully if no +matching contiguous allocation is available from Xen. +As of Xen 4.0, however, there are several +critical order>0 allocation requests that do not fail gracefully. +Notably, when a domain is created, and +order==4 structure is required or the domain creation will fail. +And shadow paging requires many order==2 +allocations; if these fail, a PV live-migration may fail. +There are likely other such issues. +<P> +But, fragmentation can occur even without tmem if any domU does +any extensive ballooning; tmem just accelerates the fragmentation. +So the fragmentation problem must be solved +anyway. The best solution is to disallow +order>0 allocations altogether in Xen -- or at least ensure that any attempt +to allocate order>0 can fail gracefully, e.g. by falling back to a sequence +of single page allocations. However this restriction may require a major rewrite +in some of Xen's most sensitive code. +(Note that order>0 allocations during Xen boot and early in domain0 +launch are safe and, if dom0 does not enable tmem, any order>0 allocation by +dom0 is safe, until the first domU is created.) +<P> +Until Xen can be rewritten to be <i>fragmentation-safe</i>, a small hack +was added in the Xen page +allocator.(See the comment " +memory is scarce" in <i>alloc_heap_pages()</i>.) +Briefly, a portion of memory is pre-reserved +for allocations where order>0 and order<9. +(Domain creation uses 2MB pages, but fails +gracefully, and there are no other known order==9 allocations or order>9 +allocations currently in Xen.) +<P> +<b><i>NUMA</i></b>. Tmem assumes that +all memory pages are equal and any RAM page can store a page of data for any +client. This has potential performance +consequences in any NUMA machine where access to <i +>far memory</i> is significantly slower than access to <i +>near memory</i>. +On nearly all of today's servers, however, +access times to <i>far memory</i> is still +much faster than access to disk or network-based storage, and tmem's primary performance +advantage comes from the fact that paging and swapping are reduced. +So, the current tmem implementation ignores +NUMA-ness; future tmem design for NUMA machines is an exercise left for the +reader. + +<h2>Bibliography</h2> + +<P> +(needs work)<b style='mso-bidi-font-weight:> +<P><a href="http://oss.oracle.com/projects/tmem">http://oss.oracle.com/projects/tmem</a> _______________________________________________ Xen-changelog mailing list Xen-changelog@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-changelog
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |