[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] turn off writable page tables
Keir Fraser wrote: On 28 Jul 2006, at 16:51, Ian Pratt wrote:So, in summary, we know writable page tables are not broken, they just don't help on typical workloads because the PTEs/page are so low. However, they do hurt SMP guest performance. If we are not seeing a benefit today, should we turn it off? Should we make it a compiletimeoption, with the default off?I wouldn't mind seeing wrpt removed altogether, or at least emulation made the compile time default for the moment. There's bound to be some workload that bites us in the future which is why batching updates on the fork path mightn't be a bad thing if it can be done without too much gratuitous hacking of linux core code.My only fear is that batched wrpt has some guest-visible effects. For example, the guest has to be able to cope with seeing page directory entries with the present bit cleared. Also, on SMP, it has to be able to cope with spurious page faults anywhere in its address space (e.g., faults on a unhooked page table which some other VCPU has rehooked by the time the Xen pagefault handler runs, hence the fault is bounced back to the guest even though there is no work to be done). If we turn off batched wrpt then guests will not be tested against it and we are likely to hit problems if we ever want to turn it back on again -- we'll find that some guests are not able to correctly handle the weird side effects.On the other hand, perhaps we can find a neater more explicit alternative to batched wrpt in future. This is a very nice win for shadow page tables on SMP. Basically, we use the lazy state information to defer all the MMU hypercalls into a single flush, which happens when leaving lazy MMU mode. At the PT level, this can be done without gratuitous hacking of linux core code. However, this can not be extended safely to also encompass the set of the parent page directory entry for SMP. It is a little unclear exactly how this would work under a direct page table hypervisor - would you still take the faults, or would you re-type and reprotect the pages first? In the fork case, there can be two page tables being updated because of COW, but re-typing both pages changes the crossover point for when the batching will be a win. But if the same hooks can be used for direct mode, it makes sense to figure that out now so we don't have to add 4 different sets of hooks to Linux (UP / SMP want slightly different batching models, as might also shadow/direct). The PDE p-bit going missing is still a problem, and Linux can be changed to deal with that - but it is messy. One remaining issue for deferring direct page table updates is the read hazard potential. I believe there is only one read hazard in the Linux mm code that has the potential to be exposed here - the explicit, rather than implicit batching makes it quite a bit easier to reason about that. Zach Implement lazy MMU update hooks which are SMP safe for both direct and shadow page tables. The idea is that PTE updates and page invalidations while in lazy mode can be batched into a single hypercall. We use this in VMI for shadow page table synchronization, and it is a win. For SMP, the enter / leave must happen under protection of the page table locks for page tables which are being modified. This is because otherwise, you end up with stale state in the batched hypercall, which other CPUs can race ahead of. Doing this under the protection of the locks guarantees the synchronization is correct, and also means that spurious faults which are generated during this window by remote CPUs are properly handled, as the page fault handler must re-check the PTE under protection of the same lock. Signed-off-by: Zachary Amsden <zach@xxxxxxxxxx> Index: linux-2.6.18-rc2/include/asm-generic/pgtable.h =================================================================== --- linux-2.6.18-rc2.orig/include/asm-generic/pgtable.h 2006-07-28 14:15:01.000000000 -0700 +++ linux-2.6.18-rc2/include/asm-generic/pgtable.h 2006-07-28 14:18:03.000000000 -0700 @@ -163,6 +163,11 @@ static inline void ptep_set_wrprotect(st #define move_pte(pte, prot, old_addr, new_addr) (pte) #endif +#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE +#define arch_enter_lazy_mmu_mode() do {} while (0) +#define arch_leave_lazy_mmu_mode() do {} while (0) +#endif + /* * When walking page tables, get the address of the next boundary, * or the end address of the range if that comes earlier. Although no Index: linux-2.6.18-rc2/mm/memory.c =================================================================== --- linux-2.6.18-rc2.orig/mm/memory.c 2006-07-28 14:15:01.000000000 -0700 +++ linux-2.6.18-rc2/mm/memory.c 2006-07-28 14:18:44.000000000 -0700 @@ -506,6 +506,7 @@ again: src_ptl = pte_lockptr(src_mm, src_pmd); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); + arch_enter_lazy_mmu_mode(); do { /* * We are holding two locks at this point - either of them @@ -525,6 +526,7 @@ again: copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + arch_leave_lazy_mmu_mode(); spin_unlock(src_ptl); pte_unmap_nested(src_pte - 1); @@ -627,6 +629,7 @@ static unsigned long zap_pte_range(struc int anon_rss = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; if (pte_none(ptent)) { @@ -692,6 +695,7 @@ static unsigned long zap_pte_range(struc pte_clear_full(mm, addr, pte, tlb->fullmm); } while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0)); + arch_leave_lazy_mmu_mode(); add_mm_rss(mm, file_rss, anon_rss); pte_unmap_unlock(pte - 1, ptl); @@ -1108,6 +1112,7 @@ static int zeromap_pte_range(struct mm_s pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; + arch_enter_lazy_mmu_mode(); do { struct page *page = ZERO_PAGE(addr); pte_t zero_pte = pte_wrprotect(mk_pte(page, prot)); @@ -1117,6 +1122,7 @@ static int zeromap_pte_range(struct mm_s BUG_ON(!pte_none(*pte)); set_pte_at(mm, addr, pte, zero_pte); } while (pte++, addr += PAGE_SIZE, addr != end); + arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); return 0; } @@ -1269,11 +1275,13 @@ static int remap_pte_range(struct mm_str pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; + arch_enter_lazy_mmu_mode(); do { BUG_ON(!pte_none(*pte)); set_pte_at(mm, addr, pte, pfn_pte(pfn, prot)); pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); + arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); return 0; } Index: linux-2.6.18-rc2/mm/mprotect.c =================================================================== --- linux-2.6.18-rc2.orig/mm/mprotect.c 2006-07-28 14:15:01.000000000 -0700 +++ linux-2.6.18-rc2/mm/mprotect.c 2006-07-28 14:17:25.000000000 -0700 @@ -33,6 +33,7 @@ static void change_pte_range(struct mm_s spinlock_t *ptl; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + arch_enter_lazy_mmu_mode(); do { oldpte = *pte; if (pte_present(oldpte)) { @@ -62,6 +63,7 @@ static void change_pte_range(struct mm_s } } while (pte++, addr += PAGE_SIZE, addr != end); + arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); } Index: linux-2.6.18-rc2/mm/mremap.c =================================================================== --- linux-2.6.18-rc2.orig/mm/mremap.c 2006-07-28 14:15:01.000000000 -0700 +++ linux-2.6.18-rc2/mm/mremap.c 2006-07-28 14:17:25.000000000 -0700 @@ -99,6 +99,7 @@ static void move_ptes(struct vm_area_str if (new_ptl != old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + arch_enter_lazy_mmu_mode(); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, new_pte++, new_addr += PAGE_SIZE) { if (pte_none(*old_pte)) @@ -108,6 +109,7 @@ static void move_ptes(struct vm_area_str pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); set_pte_at(mm, new_addr, new_pte, pte); } + arch_leave_lazy_mmu_mode(); if (new_ptl != old_ptl) spin_unlock(new_ptl); Index: linux-2.6.18-rc2/mm/msync.c =================================================================== --- linux-2.6.18-rc2.orig/mm/msync.c 2006-07-28 14:15:01.000000000 -0700 +++ linux-2.6.18-rc2/mm/msync.c 2006-07-28 14:17:25.000000000 -0700 @@ -30,6 +30,7 @@ static unsigned long msync_pte_range(str again: pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + arch_enter_lazy_mmu_mode(); do { struct page *page; @@ -51,6 +52,7 @@ again: ret += set_page_dirty(page); progress += 3; } while (pte++, addr += PAGE_SIZE, addr != end); + arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); cond_resched(); if (addr != end) _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |