Xen project Mailing List

Re: [Xen-devel] [PATCH] x86/PV: fix unintended dependency of m2p-strict mode on migration-v2

To: "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Tue, 12 Jan 2016 08:19:28 -0700

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>

Delivery-date: Tue, 12 Jan 2016 15:19:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 12.01.16 at 12:55, <andrew.cooper3@xxxxxxxxxx> wrote: > On 12/01/16 10:08, Jan Beulich wrote: >> This went unnoticed until a backport of this to an older Xen got used, >> causing migration of guests enabling this VM assist to fail, because >> page table pinning there preceeds vCPU context loading, and hence L4 >> tables get initialized for the wrong mode. Fix this by post-processing >> L4 tables when setting the intended VM assist flags for the guest. >> >> Note that this leaves in place a dependency on vCPU 0 getting its guest >> context restored first, but afaict the logic here is not the only thing >> depending on that. >> >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> >> >> --- a/xen/arch/x86/domain.c >> +++ b/xen/arch/x86/domain.c >> @@ -1067,8 +1067,48 @@ int arch_set_info_guest( >> goto out; >> >> if ( v->vcpu_id == 0 ) >> + { >> d->vm_assist = c(vm_assist); >> >> + /* >> + * In the restore case we need to deal with L4 pages which got >> + * initialized with m2p_strict still clear (and which hence lack the >> + * correct initial RO_MPT_VIRT_{START,END} L4 entry). >> + */ >> + if ( d != current->domain && VM_ASSIST(d, m2p_strict) && >> + is_pv_domain(d) && !is_pv_32bit_domain(d) && >> + atomic_read(&d->arch.pv_domain.nr_l4_pages) ) >> + { >> + bool_t done = 0; >> + >> + spin_lock_recursive(&d->page_alloc_lock); >> + >> + for ( i = 0; ; ) >> + { >> + struct page_info *page = >> page_list_remove_head(&d->page_list); >> + >> + if ( page_lock(page) ) >> + { >> + if ( (page->u.inuse.type_info & PGT_type_mask) == >> + PGT_l4_page_table ) >> + done = !fill_ro_mpt(page_to_mfn(page)); >> + >> + page_unlock(page); >> + } >> + >> + page_list_add_tail(page, &d->page_list); >> + >> + if ( done || (!(++i & 0xff) && hypercall_preempt_check()) ) >> + break; >> + } >> + >> + spin_unlock_recursive(&d->page_alloc_lock); >> + >> + if ( !done ) >> + return -ERESTART; > > This is a long loop. It is preemptible, but will incur a time delay > proportional to the size of the domain during the VM downtime. > > Could you defer the loop until after %cr3 has set been set up, and only > enter the loop if the kernel l4 table is missing the RO mappings? That > way, domains migrated with migration v2 will skip the loop entirely. Well, first of all this would be the result only as long as you or someone else don't re-think and possibly move pinning ahead of context load again. Deferring until after CR3 got set up is - afaict - not an option, as it would defeat the purpose of m2p-strict mode as much as doing the fixup e.g. in the #PF handler. This mode enabled needs to strictly mean "L4s start with the slot filled, and user-mode uses clear it", as documented. There's a much simpler way we could avoid the loop being entered: Check the previous setting of the flag. However, I intentionally did not go that route in this initial version as I didn't want to add more special casing than needed, plus to make sure the new code isn't effectively dead. >> --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -1463,13 +1463,20 @@ void init_guest_l4_table(l4_pgentry_t l4 >> l4tab[l4_table_offset(RO_MPT_VIRT_START)] = l4e_empty(); >> } >> >> -void fill_ro_mpt(unsigned long mfn) >> +bool_t fill_ro_mpt(unsigned long mfn) >> { >> l4_pgentry_t *l4tab = map_domain_page(_mfn(mfn)); >> + bool_t ret = 0; >> >> - l4tab[l4_table_offset(RO_MPT_VIRT_START)] = >> - idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)]; >> + if ( !l4e_get_intpte(l4tab[l4_table_offset(RO_MPT_VIRT_START)]) ) >> + { >> + l4tab[l4_table_offset(RO_MPT_VIRT_START)] = >> + idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)]; >> + ret = 1; > > This is a behavioural change. Previously, the old value was clobbered. > > It appears that you are now using this return value to indicate when the > entire pagelist has been walked, but it it relies on the slots being > zero, which is a fragile assumption. There are only two values possible in this slot - zero or the one referring to the _shared across domains_ sub-tree for the r/o MPT. I.e. the change of behavior is only an apparent one, and I don't see this being fragile either. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.