[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [RFC] New shadow paging code
We (Michael Fetterman, George Dunlap and I) have been working over the last while on a full replacement for Xen's shadow pagetable support. This mail contains some design notes, below; a patch against xen-unstable, giving a snapshot of the current state of the new shadow code, is at http://www.cl.cam.ac.uk/~tjd21/shadow2.patch Comments on both are welcome, although the code is not finished -- in particular there are both some optimizations and some tidying-up that need to be done. Cheers, Tim. ---- The new shadow code (dubbed 'shadow2'), is designed as a replacement for the current shadow code. It's been designed from the ground up to support the following capabilities: * Work for both paravirtualized and HVM guests. Our focus is on Windows under HVM, since Linux guests can use paravirtual mechanisms for faster memory management. * Xen may be running in 2-, 3-, or 4-level paging mode. While booting, guests may be in direct-access mode (no paging), or any paging level less than or equal to Xen's current paging level. This means that we must support 2-on-2, 2-on-3, 3-on-3, 3-on-4, and 4-on-4 paging modes. * While bringing up secondary vcpus in an SMP system, the vcpus may all be in different paging modes. We must support these simultaneously. * Logdirty mode for live migration. * We must work with paravirtualized drivers for HVM domains. * We must work for guest superpages. With this in mind, we have made several design choices: * Do away with the "out-of-sync" mechanism to begin with. After a page is promoted, emulate all writes to it until it is demoted again. This makes the logic a lot simpler, and also reduces the overhead of demand paging, which is one of the most common Windows modes. (See below for more information on demand paging.) * In the case of a size mismatch between guest pagetable entries and host pagetable entries (i.e., 2-on-3 or 2-on-4, where guest pagetable entries are 32 bits and host pagetable entires are 64 bits), a single guest page may need to be shadowed by multiple shadow pages. In this case, we always shadow the entire guest pagetable, rather than shadowing only part at a time. We also keep the multiple backing shadow pagetables physically contiguous in memory using a "buddy" allocator. This allows us to use only one mfn value to designate the entire group of mfns. * We allocate a fixed amount of shadow memory at domain creation. This is shared by all vcpus. When we need more shadow pages, we begin to unshadow pages to free up more memory in approximately an LRU fashion. * We keep the p2m maps for HVM domains in a pagetable format, so that we can use them as the pagetables fo HVM guests in paging-disabled mode. So far, we have had several successes. Demand-paging accesses have been sped up by doing emulated writes rather than using the out-of-sync mechanism. The out-of-sync mechanism requires three page faults, two of which entail relative expensive shadow operations: marking a page out of sync, and bringing it back into sync. In the case of HVM guests, the faults also cause three expensive vmexit/vmenter cycles. Our emulated writes requires only two page faults, and each fault is less expensive. Also, the overhead of many individual shadow operations is less in the newer code than in the old code. We have a number of potential optimizations in mind for the near future: * Removing writable mappings. As with the old code, when a guest pfn is promoted to be a pagetable, we need to find and remove all writable mappings to it, so that we can detect changes. Following the "start simple, then optimize" principle, our current code does a brute-force search through the shadows. Our tests indicate that when a page is promoted to a pagetable, it generally has exactly one writable mapping outstanding. This is true both for Windows and for Linux. We plan to use this fact to keep a back-pointer to the last writable shadow pte of a page in the page_info struct of a page. The few exceptions to the rule can still be handled using brute-force search. * Fast-pathing some faults. By storing the guest present / writable flags in some of the spare bits of the guest pagetable, we can fast-path certain operations, such as propagating a fault to the guest or updating guest dirty and accessed bits, without needing to map the guest pagetables. This should speed up some common faults, as well as reduce cache footprint. * Batch updates. There are times when guests do batch updates to pagetables. At these times, it makes sense to give the guest write access to the pagetables. At first this can be done simply by unshadowing the page entirely. In the future, we can explore whether a a "mark out of sync" mechanism would speed things up. We may be able to have a more extreme optimization for Linux fork(): when we detect Linux doing a fork(), we can unshadow the entire user portion of the guest address space, to save having to detect a "batch update" and unshadow each guest pagetable individually. * Full emulation of shadow page accesses. Currently, we allow read-only access to guest pagetables. This requires us to emulate the dirty and accessed bits of the guest pagetables, in turn requiring us to take page faults. But how many of these dirty/accessed bits are actually read? It may be more efficient, in certain circumstances, to emualte reads to guest page tables as well as writes, taking the dirty and accessed bits from the shadow pagetables. * Teardown heuristics. If we can determine when a guest is destroying a process, we can unshadow the whole address space at once. Failure to detect when a process is being torn down will cause unnecessary overhead: if the guest pagetables of the destroyed process are recycled as data pages, all writes to the pages will be emulated (in a rather expensive manner) until the page is unshadowed. Even if the guest pagetables are re-used for new process pagetables, constructing the address space will be faster if unshadowed. ************** Code Structure ************** Our code must deal differently with all the different combinations of shadow modes. However, we expect that once a guest reaches its target paging mode, it will stay in that mode for a long time; and the host will never change its paging mode. Rather than having a whole string of ifs in the code based on the current guest and host paging modes, we compile different code to deal with each pair of modes (2-on-2, 2-on-3, 2-on-4, 3-on-3, 3-on-4, 4-on-4). (Direct mode is implemented as a special case of m-on-m, where m is the host's current paging level.) While increasing the size of the hypervisor overall, this should greatly decrease both the cache footprint of the shadow code and reduce pipeline flushes from mispredicted branches. To keep from having to maintain duplicate logic across 6 different bits of code, we use a single source code file, and compiler directives to specify mode-specific code. This file is shadow2.c, and is built once with GUEST_PAGING_LEVELS and SHADOW_PAGING_LEVELS set to the appropriate combination. The compiler is set to redefine the functions from sh2_[function_name]() to sh2_[function_name]__shadow_[m]_guest_[n] for n-on-m mode. At the end of shadow2.c is a structure containing function pointers for each of the mode-specific functions; this is called shadow2_entry (and is expanded by preprocessor directives using the __shadow_[m]_guest_[n] naming convention). When a guest vcpu is put into a particular shadow mode, an element of the vcpu struct is pointed to the appropriate shadow2_entry struct. To call the appropriate function, one generally calls shadow2_[function_name](v, [args]), which is generally implemented after the following template: [rettype] shadow2_[function_name](v, [args]) { return v->arch.shadow2->[function_name](v, [args]); } _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |