[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH] tmem internals doc
Keir, please hg add the attached doc as docs/misc/tmem-internals.html The info applies for 4.0 as well, so the doc can also go into 4.0-testing. Thanks! Dan Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx> Transcendent Memory Internals in Xenby Dan Magenheimer, Oracle Corp. Draft 0.1 -- Updated: 20100324 OverviewThis document focuses on the internal implementation of Transcendent Memory (tmem) on Xen. It assumes that the reader has a basic knowledge of the terminology, objectives, and functionality of tmem and also has access to the Xen source code. It corresponds to the Xen 4.0 release, with patch added to support page deduplication (V2). The primary responsibilities of the tmem implementation are to:
Source Code OrganizationThe source code in Xen that provides the tmem functionality is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h. The files tmem.c and tmem.h are intended to be implementation- (and hypervisor-) independent and the other two files provide the Xen-specific code. This division is intended to make it easier to port tmem functionality to other hypervisors, though at this time porting to other hypervisors has not been attempted. Together, these four files total less than 4000 lines of C code. Even ignoring the implementation-specific functionality, the implementation-independent part of tmem has several dependencies on library functionality (Xen source filenames in parentheses):
More information about the specific functionality of these libraries can easily be found through a search engine, via wikipedia, or in the Xen or Linux source logs so we will not elaborate further here. Prefixes/Abbreviations/GlossaryThe tmem code uses several prefixes and abbreviations. Knowledge of these will improve code readability:
When used in prose, common tmem operations are indicated with a different font, such as put and get. Key Data StructuresTo manage a huge number of pages, efficient data structures must be carefully selected. Recall that a tmem-enabled guest OS may create one or more pools with different attributes. It then puts and gets pages to/from this pool, identifying the page with a handle that consists of a pool_id, an object_id, and a page_id (sometimes called an index). This suggests a few obvious core data structures:
The most apparent usage of this multi-layer web of data structures is "top-down" because, in normal operation, the vast majority of tmem operations invoked by a client are puts and gets, which require the various data structures to be walked starting with the client_t, then a pool_t, then an obj_t, then a pgd_t. However, there is another highly frequent tmem operation that is not visible from a client: memory reclamation. Since tmem attempts to use all spare memory in the system, it must frequently free up, or evict, pages. The eviction algorithm will be explained in more detail later but, in brief, to free memory, ephemeral pages are removed from the tail of one of the doubly-linked lists, which means that all of the data structures associated with that page-to-be-removed must be updated or eliminated and freed. As a result, each data structure also contains a back-pointer to its parent, for example every obj_t contains a pointer to its containing pool_t. This complex web of interconnected data structures is updated constantly and thus extremely sensitive to careless code changes which, for example, may result in unexpected hypervisor crashes or non-obvious memory leaks. On the other hand, the code is fairly well modularized so, once understood, it is possible to relatively easily switch out one kind of data structure for another. To catch problems as quickly as possible when debug is enabled, most of the data structures are equipped with sentinelsand many inter-function assumptions are documented and tested dynamically with assertions. While these clutter and lengthen the tmem code substantially, their presence has proven invaluable on many occasions. For completeness, we should also describe a key data structure in the Xen implementation-dependent code: the tmh_page_list. For security and performance reasons, pages that are freed due to tmem operations (such as get) are not immediately put back into Xen's pool of free memory (aka the Xen heap). Tmem pages may contain guest-private data that must be scrubbed before those memory pages are released for the use of other guests. But if a page is immediately re-used inside of tmem itself, the entire page is overwritten with new data, so need not be scrubbed. Since tmem is usually the most frequent customer of the Xen heap allocation code, it would be a waste of time to scrub a page, release it to the Xen heap, and then immediately re-allocate it again. So, instead, tmem maintains currently-unused pages of memory on its own free list, tmh_page_list, and returns the pages to Xen only when non-tmem Xen heap allocation requests would otherwise fail. Scalablility/ConcurrencyTmem has been designed to be highly scalable. Since tmem access is invoked similarly in many ways to asynchronous disk access, a "big SMP" tmem-aware guest OS can, and often will, invoke tmem hypercalls simultaneously on many different physical CPUs. And, of course, multiple tmem-aware guests may independently and simultaneously invoke tmem hypercalls. While the normal frequency of tmem invocations is rarely extremely high, some tmem operations such as data compression or lookups in a very large tree may take tens of thousands of cycles or more to complete. Measurements have shown that normal workloads spend no more than about 0.2% (2% with compression enabled) of CPU time executing tmem operations. But those familiar with OS scalability issues recognize that even this limited execution time can create concurrency problems in large systems and result in poorly-scalable performance. A good locking strategy is critical to concurrency, but also must be designed carefully to avoid deadlock and livelock problems. For debugging purposes, tmem supports a "big kernel lock" which disables concurrency altogether (enabled in Xen with "tmem_lock", but note that this functionality is rarely tested and likely has bit-rotted). Infrequent but invasive tmem hypercalls, such as pool creation or the control operations, are serialized on a single read-write lock, called tmem_rwlock, which must be held for writing. All other tmem operations must hold this lock for reading, so frequent operations such as put and get flush can execute simultaneously as long as no invasive operations are occurring. Once a pool has been selected, there is a per-pool read-write lock (pool_rwlock) which must be held for writing if any transformative operations might occur within that pool, such as when an obj_t is created or destroyed. For the highly frequent operation of finding an obj_t within a pool, pool_rwlock must be held for reading. Once an object has been selected, there is a per-object spinlock (obj_spinlock). This is a spinlock rather than a read-write lock because nearly all of the most frequent tmem operations (e.g. put and get flush) are transformative, in that they add or remove a page within the object. This lock is generally taken whenever an object lookup occurs and released when the tmem operation is complete. Next, the per-client and global ephemeral lists are protected by a single global spinlock (eph_lists_spinlock) and the per-client persistent lists are also protected by a single global spinlock (pers_list_spinlock). And to complete the description of implementation-independent locks, if page deduplication is enabled, all pages for which the first byte match are contained in one of 256 trees that are protected by one of 256 corresponding read-write locks (pcd_tree_rwlocks). In the Xen-specific code (tmem_xen.c), page frames (e.g. struct page_info) that have been released are kept in a list (tmh_page_list) that is protected by a spinlock (tmh_page_list_lock). There is also an "implied" lock associated with compression, which is likely the most time-consuming operation in all of tmem (of course, only when compression is enabled): A compression buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to this buffer is returned to implementation-independent code and used without a lock. The proper method to avoid deadlocks is to take and release locks in a very specific predetermined order. Unfortunately, since tmem data structures must simultaneously be accessed "top-down" ( put and get) and "bottoms-up" (memory reclamation), more complex methods must be employed: A trylockmechanism is used (c.f. tmem_try_to_evict_pgp()), which takes the lock if it is available but returns immediately (rather than spinning and waiting) if the lock is not available. When walking the ephemeral list to identify pages to free, any page that belongs to an object that is locked is simply skipped. Further, if the page is the last page belonging to an object, and the pool read-write lock for the pool the object belongs to is not available (for writing), that object is skipped. These constraints modify the LRU algorithm somewhat, but avoid the potential for deadlock. Unfortunately, a livelock was still discovered in this approach: When memory is scarce and each client is putting a large number of pages for exactly one object (and thus holding the object spinlock for that object), memory reclamation takes a very long time to determine that it is unable to free any pages, and so the time to do a put (which eventually fails) becomes linear to the number of pages in the object! To avoid this situation, a workaround was added to always ensure a minimum amount of memory (1MB) is available before any object lock is taken for the client invoking tmem (see tmem_ensure_avail_pages()). Other such livelocks (and perhaps deadlocks) may be lurking. A last issue related to concurrency is atomicity of counters. Tmem gathers a large number of statistics. Some of these counters are informational only, while some are critical to tmem operation and must be incremented and decremented atomically to ensure, for example, that the number of pages in a tree never goes negative if two concurrent tmem operations access the counter exactly simultaneously. Some of the atomic counters are used for debugging (in assertions) and perhaps need not be atomic; fixing these may increase performance slightly by reducing cache-coherency traffic. Similarly, some of the non-atomic counters may yield strange results to management tools, such as showing the total number of successful puts as being higher than the number of puts attempted. These are left as exercises for future tmem implementors. Control and ManageabilityTmem has a control interface to, for example, set various parameters and obtain statistics. All tmem control operations funnel through do_tmem_control() and other functions supporting tmem control operations are prefixed with tmemc_. During normal operation, even if only one tmem-aware guest is running, tmem may absorb nearly all free memory in the system for its own use. Then if a management tool wishes to create a new guest (or migrate a guest from another system to this one), it may notice that there is insufficient "free" memory and fail the creation (or migration). For this reason, tmem introduces a new tool-visible class of memory -- freeable memory -- and provides a control interface to access it. All ephemeral memory and all pages on the tmh_page_list are freeable. To properly access freeable memory, a management tool must follow a sequence of steps:
Extensive tmem statistics are available through tmem's control interface (see tmemc_list and the separate source for the "xm tmem-list" command and the xen-tmem-list-parse tool). To maximize forward/backward compatibility with future tmem and tools versions, statistical information is passed via an ASCII interface where each individual counter is identified by an easily parseable two-letter ASCII sequence. Save/Restore/MigrateAnother piece of functionality that has a major impact on the tmem code is support for save/restore of a tmem client and, highly related, live migration of a tmem client. Ephemeral pages, by definition, do not need to be saved or live-migrated, but persistent pages are part of the state of a running VM and so must be properly preserved. When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step is for the tmem client to be frozen (see the manageability section). Next, tmem API version information is recorded (to avoid possible incompatibility issues as the tmem spec evolves in the future). Then, certain high-level tmem structural information specific to the client is recorded, including information about the existing pools. Finally, the contents of all persistent pages are recorded. For live-migration, the process is somewhat more complicated. Ignoring tmem for a moment, recall that in live migration, the vast majority of the VM's memory is transferred while the VM is still fully operational. During each phase, memory pages belonging to the VM that are changed are marked and then retransmitted during a later phase. Eventually only a small amount of memory remains, the VM is paused, the remaining memory is transmitted, and the VM is unpaused on the target machine. The number of persistent tmem pages may be quite large, possibly even larger than all the other memory used by the VM; so it is unacceptable to transmit persistent tmem pages during the "paused" phase of live migration. But if the VM is still operational, it may be making calls to tmem: A frozen tmem client will reject any put operations, but tmem must still correctly process flushes (page and object), including implicit flushes due to duplicate puts. Fortunately, these operations can only invalidate tmem pages, not overwrite tmem pages or create new pages. So, when a live-migrate has been initiated, the client is frozen. Then during the "live" phase, tmem transmits all persistent pages, but also records the handle of all persistent pages that are invalidated. Then, during the "paused" phase, only the handles of invalidated persistent pages are transmitted, resulting in the invalidation on the target machine of any matching pages that were previously transmitted during the "live" phase. For restore (and on the target machine of a live migration), tmem must be capable of reconstructing the internal state of the client from the saved/migrated data. However, it is not the client itself that is put'ing the pages but the management tools conducting the restore/migration. This slightly complicates tmem by requiring new API calls and new functions in the implementation, but the code is structured so that duplication is minimized. Once all tmem data structures for the client are reconstructed, all persistent pages are recreated and, in the case of live-migration, all invalidations have been processed and the client has been thawed, the restored client can be resumed. Finally, tmem's data structures must be cluttered a bit to support save/restore/migration. Notably, a per-pool list of persistent pages must be maintained and, during live migration, a per-client list of invalidated pages must be logged. A reader of the code will note that these lists are overlaid into space-sensitive data structures as a union, which may be more error-prone but eliminates significant space waste. Miscellaneous Tmem TopicsDuplicate puts. One interesting corner case that significantly complicates the tmem source code is the possibility of a duplicate put, which occurs when two puts are requested with the same handle but with possibly different data. The tmem API addresses put-put-get coherence explicitly: When a duplicate put occurs, tmem may react one of two ways: (1) The put may succeed with the old data overwritten by the new data, or (2) the put may be failed with the original data flushed and neither the old nor the new data accessible. Tmem may not fail the put and leave the old data accessible. When tmem has been actively working for an extended period, system memory may be in short supply and it is possible for a memory allocation for a page (or even a data structure such as a pgd_t) to fail. Thus, for a duplicate put, it may be impossible for tmem to temporarily simultaneously maintain data structures and data for both the original put and the duplicate put. When the space required for the data is identical, tmem may be able to overwrite in place the old data with the new data (option 1). But in some circumstances, such as when data is being compressed, overwriting is not always possible and option 2 must be performed. Page deduplication and trailing-zero elimination. When page deduplication is enabled ("tmem_dedup" option to Xen), ephemeral pages for which the contents are identical -- whether the pages belong to the same client or different clients -- utilize the same pageframe of memory. In Xen environments where multiple domains have a highly similar workload, this can save a substantial amount of memory, allowing a much larger number of ephemeral pages to be used. Tmem page deduplication uses methods similar to the KSM implementation in Linux [ref], but differences between the two are sufficiently great that tmem does not directly leverage the code. In particular, ephemeral pages in tmem are never dirtied, so need never be copied-on-write. Like KSM, however, tmem avoids hashing, instead employing red-black trees that use the entire page contents as the lookup key. There may be better ways to implement this. Dedup'ed pages may optionally be compressed ("tmem_compress" and "tmem_dedup" Xen options specified), to save even more space, at the cost of more time. Additionally, trailing zero elimination (tze) may be applied to dedup'ed pages. With tze, pages that contain a significant number of zeroes at the end of the page are saved without the trailing zeroes; an all-zero page requires no data to be saved at all. In certain workloads that utilize a large number of small files (and for which the last partial page of a file is padded with zeroes), a significant space savings can be realized without the high cost of compression/decompression. Both compression and tze significantly complicate memory allocation. This will be discussed more below. Memory accounting. Accounting is boring, but poor accounting may result in some interesting problems. In the implementation-independent code of tmem, most data structures, page frames, and partial pages (e.g. for compresssion) are billed to a pool, and thus to a client. Some infrastructure data structures, such as pools and clients, are allocated with tmh_alloc_infra(), which does not require a pool to be specified. Two other exceptions are page content descriptors (pcd_t) and sharelists (sharelist_t) which are explicitly not associated with a pool/client by specifying NULL instead of a pool_t. (Note to self: These should probably just use the tmh_alloc_infra() interface too.) As we shall see, persistent pool pages and data structures may need to be handled a bit differently, so the implementation-independent layer calls a different allocation/free routine for persistent pages (e.g. tmh_alloc_page_thispool()) than for ephemeral pages (e.g. tmh_alloc_page()). In the Xen-specific layer, we disregard the pool_t for ephemeral pages, as we use the generic Xen heap for all ephemeral pages and data structures.(Denial-of-service attacks can be handled in the implementation-independent layer because ephemeral pages are kept in per-client queues each with a counted length. See the discussion on weights and caps below.) However we explicitly bill persistent pages and data structures against the client/domain that is using them. (See the calls to the Xen routine alloc_domheap_pages() in tmem_xen.h; of the first argument is a domain, the pages allocated are billed by Xen to that domain.)This means that a Xen domain cannot allocate even a single tmem persistent page when it is currently utilizing its maximum assigned memory allocation! This is reasonable for persistent pages because, even though the data is not directly accessible by the domain, the data is permanently saved until either the domain flushes it or the domain dies. Note that proper accounting requires (even for ephemeral pools) that the same pool is referenced when memory is freed as when it was allocated, even if the ownership of a pool has been moved from one client to another (c.f. shared_pool_reassign()). The underlying Xen-specific information may not always enforce this for ephemeral pools, but incorrect alloc/free matching can cause some difficult-to-find memory leaks and bent pointers. Page deduplication is not possible for persistent pools for accounting reasons: Imagine a page that is created by persistent pool A, which belongs to a domain that is currently well under its maximum allocation. Then the pcd_tis matched by persistent pool B, which is currently at its maximum. Then the domain owning pool A is destroyed. Is B beyond its maximum? (There may be a clever way around this problem. Exercise for the reader!) Memory allocation. The implementation-independent layer assumes there is a good fast general-purpose dynamic memory allocator with bounded response time and efficient use of memory for a very large number of sub-page allocations. The old xmalloc memory allocator in Xen was not a good match for this purpose, so was replaced by the TLSF allocator. Note that the TLSF allocator is used only for allocations smaller than a page (and, more precisely, no larger than tmem_subpage_maxsize()); full pages are allocated by Xen's normal heap allocator. After the TLSF allocator was integrated into Xen, more work was required so that each client could allocate memory from a separate independent pool. (See the call to xmem_pool_create()in tmh_client_init().) This allows the data structures allocated for the purpose of supporting persistent pages to be billed to the same client as the pages themselves. It also allows partial (e.g. compressed) pages to be properly billed. Further, when partial page allocations cause internal fragmentation, this fragmentation can be isolated per-client. And, when a domain dies, full pages can be freed, rather than only partial pages. One other change was required in the TLSF allocator: In the original version, when a TLSF memory pool was allocated, the first page of memory was also allocated. Since, for a persistent pool, this page would be billed to the client, the allocation of the first page failed if the domain was started at its maximum memory, and this resulted in a failure to create the memory pool. To avoid this, the code was changed to delay the allocation of the first page until first use of the memory pool. Memory allocation interdependency. As previously described, pages of memory must be moveable back and forth between the Xen heap and the tmem ephemeral lists (and page lists). When tmem needs a page but doesn't have one, it requests one from the Xen heap (either indirectly via xmalloc, or directly via Xen's alloc_domheap_pages()). And when Xen needs a page but doesn't have one, it requests one from tmem (via a call to tmem_relinquish_pages() in Xen's alloc_heap_pages() in page_alloc.c). This leads to a potential infinite loop! To break this loop, a new memory flag (MEMF_tmem) was added to Xen to flag and disallow the loop. See tmh_called_from_tmem() in tmem_relinquish_pages(). Note that the tmem_relinquish_pages() interface allows for memory requests of order > 0 (multiple contiguous pages), but the tmem implementation disallows any requests larger than a single page. LRU page reclamation. Ephemeral pages generally age in a queue, and the space associated with the oldest -- or least-recently-used -- page is reclaimed when tmem needs more memory. But there are a few exceptions to strict LRU queuing. First is when removal from a queue is constrained by locks, as previously described above. Second, when an ephemeral pool is shared, unlike a private ephemeral pool, a get does not imply a flush Instead, in a shared pool, a results in the page being promoted to the front of the queue. Third, when a page that is deduplicated (i.e. is referenced by more than one pgp_t) reaches the end of the LRU queue, it is marked as eviction attempted and promoted to the front of the queue; if it reaches the end of the queue a second time, eviction occurs. Note that only the pgp_t is evicted; the actual data is only reclaimed if there is no other pgp_t pointing to the data. All of these modified- LRU algorithms deserve to be studied carefully against a broad range of workloads. Internal fragmentation. When compression or tze is enabled, allocations between a half-page and a full-page in size are very common and this places a great deal of pressure on even the best memory allocator. Additionally, problems may be caused for memory reclamation: When one tmem ephemeral page is evicted, only a fragment of a physical page of memory might be reclaimed. As a result, when compression or tze is enabled, it may take a very large number of eviction attempts to free up a full contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction must be assumed to be able to fail. While all memory allocation paths in tmem are resilient to failure, very complex corner cases may eventually occur. As a result, compression and tze are disabled by default and should be used with caution until they have been tested with a much broader set of workloads.(Note to self: The code needs work.) Weights and caps. Because of the just-discussed LRU-based eviction algorithms, a client that uses tmem at a very high frequency can quickly swamp tmem so that it provides little benefit to a client that uses it less frequently. To reduce the possibility of this denial-of-service, limits can be specified via management tools that are enforced internally by tmem. On Xen, the "xm tmem-set" command can specify "weight=<weight>" or "cap=<cap>" for any client. If weight is non-zero for a client and the current percentage of ephemeral pages in use by the client exceeds its share (as measured by the sum of weights of all clients), the next page chosen for eviction is selected from the requesting client's ephemeral queue, instead of the global ephemeral queue that contains pages from all clients.(See client_over_quota().) Setting a cap for a client is currently a no-op. Shared pools and authentication. When tmem was first proposed to the linux kernel mailing list (LKML), there was concern expressed about security of shared ephemeral pools. The initial tmem implementation only required a client to provide a 128-bit UUID to identify a shared pool, and the linux-side tmem implementation obtained this UUID from the superblock of the shared filesystem (in ocfs2). It was pointed out on LKML that the UUID was essentially a security key and any malicious domain that guessed it would have access to any data from the shared filesystem that found its way into tmem. Ocfs2 has only very limited security; it is assumed that anyone who can access the filesystem bits on the shared disk can mount the filesystem and use it. But in a virtualized data center, higher isolation requirements may apply. As a result, a Xen boot option -- "tmem_shared_auth" -- was added. The option defaults to disabled, but when it is enabled, management tools must explicitly authenticate (or may explicitly deny) shared pool access to any client. On Xen, this is done with the "xm tmem-shared-auth" command. 32-bit implementation. There was some effort put into getting tmem working on a 32-bit Xen. However, the Xen heap is limited in size on 32-bit Xen so tmem did not work very well. There are still 32-bit ifdefs in some places in the code, but things may have bit-rotted so using tmem on a 32-bit Xen is not recommended. IA-64 implementation. The vast majority of the tmem implementation is architecture-independent. For tmem to run on Xen/ia64, it is believed that only one or two routines needs to be written.(See the #ifdef __ia64__ at cli_mfn_to_va().) Known IssuesFragmentation.When tmem is active, all physically memory becomes fragmented into individual pages. However, the Xen memory allocator allows memory to be requested in multi-page contiguous quantities, called order>0 allocations. (e.g. 2order so order==4 is sixteen contiguous pages.) In some cases, a request for a larger order will fail gracefully if no matching contiguous allocation is available from Xen. As of Xen 4.0, however, there are several critical order>0 allocation requests that do not fail gracefully. Notably, when a domain is created, and order==4 structure is required or the domain creation will fail. And shadow paging requires many order==2 allocations; if these fail, a PV live-migration may fail. There are likely other such issues. But, fragmentation can occur even without tmem if any domU does any extensive ballooning; tmem just accelerates the fragmentation. So the fragmentation problem must be solved anyway. The best solution is to disallow order>0 allocations altogether in Xen -- or at least ensure that any attempt to allocate order>0 can fail gracefully, e.g. by falling back to a sequence of single page allocations. However this restriction may require a major rewrite in some of Xen's most sensitive code. (Note that order>0 allocations during Xen boot and early in domain0 launch are safe and, if dom0 does not enable tmem, any order>0 allocation by dom0 is safe, until the first domU is created.) Until Xen can be rewritten to be fragmentation-safe, a small hack was added in the Xen page allocator.(See the comment " memory is scarce" in alloc_heap_pages().) Briefly, a portion of memory is pre-reserved for allocations where order>0 and order<9. (Domain creation uses 2MB pages, but fails gracefully, and there are no other known order==9 allocations or order>9 allocations currently in Xen.) NUMA. Tmem assumes that all memory pages are equal and any RAM page can store a page of data for any client. This has potential performance consequences in any NUMA machine where access to far memory is significantly slower than access to near memory. On nearly all of today's servers, however, access times to far memory is still much faster than access to disk or network-based storage, and tmem's primary performance advantage comes from the fact that paging and swapping are reduced. So, the current tmem implementation ignores NUMA-ness; future tmem design for NUMA machines is an exercise left for the reader. Bibliography(needs work) |