Xen project Mailing List

Re: [Xen-devel] Issue policing writes from Xen to PV domain memory

From: "Aravindh Puthiyaparambil (aravindp)" <aravindp@xxxxxxxxx>

Date: Fri, 9 May 2014 02:42:00 +0000

Accept-language: en-US

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "TimDeegan\(tim@xxxxxxx\)" <tim@xxxxxxx>

Delivery-date: Fri, 09 May 2014 02:42:12 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: Ac9kBh2tUJWI8Pt1R5ScY+6HxANwigAZteGAABVfxTAAUzWYAAAFEsLAAAkZ2jAAgw0LgAAQcqHAACVFp4AAC7jQAAAjpSiAAELjU4A=

Thread-topic: Issue policing writes from Xen to PV domain memory

>I can only repeat what I said above: You first need to understand why the ring >is (or appears to be) full. But even with that clarified you still need to >have a >proper solution for the case where the ring might end up being full for valid >reasons. I continued digging into if the ring is actually getting full. The ring is indeed getting full with events for the runstate area. If I run with my patch to wait_event(), the first chance the listener looks for unconsumed ring requests it finds around 50-60 of them all pointing to the runstate area page. This is proved further below. I then moved to trying to figure out why it is getting full. I added the following patch on top of the one I submitted. It adds couple of prints to sh_page_fault(), allow Xen writes to be policed and print out runstate_guest in VCPUOP_register_runstate_memory_area. diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c index aee061c..569cea8 100644 --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -2846,7 +2846,7 @@ static int sh_page_fault(struct vcpu *v, int fast_emul = 0; #endif - SHADOW_PRINTK("d:v=%u:%u va=%#lx err=%u, rip=%lx\n", + SHADOW_ERROR("d:v=%u:%u va=%#lx err=%u, rip=%lx\n", v->domain->domain_id, v->vcpu_id, va, regs->error_code, regs->eip); @@ -3073,6 +3073,7 @@ static int sh_page_fault(struct vcpu *v, * scheduled during this time, the violation is never resolved and * will eventually end with the host crashing. */ +#if 0 if ( (violation && access_w) && (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) ) { @@ -3080,6 +3081,7 @@ static int sh_page_fault(struct vcpu *v, rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K, p2m_ram_rw, p2m_access_rw); } +#endif if ( violation ) { @@ -3089,7 +3091,7 @@ static int sh_page_fault(struct vcpu *v, access_r, access_w, access_x, &req_ptr) ) { - SHADOW_PRINTK("Page access %c%c%c for gmfn=%"PRI_mfn" " + SHADOW_ERROR("Page access %c%c%c for gmfn=%"PRI_mfn" " "p2ma: %d\n", (access_r ? 'r' : '-'), (access_w ? 'w' : '-'), diff --git a/xen/common/domain.c b/xen/common/domain.c index 4291e29..3281195 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -1186,6 +1186,7 @@ long do_vcpu_op(int cmd, int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg) rc = 0; runstate_guest(v) = area.addr.h; + gdprintk(XENLOG_DEBUG, "runstate_guest: %p\n", runstate_guest(v).p); if ( v == current ) { On attaching xen-access listener to the PV domain, I see the following: <Xen serial output> (d2) mapping kernel into physical memory (d2) about to get started... (XEN) domain.c:1189:d2v0 runstate_guest: ffffffff81cefd80 (XEN) domain.c:1189:d2v0 runstate_guest: ffff88000fc0bd80 <PV guest is up and running now> <Attaching xen-access listener> (XEN) sh error: sh_page_fault__guest_4(): d:v=2:0 va=0xffff88000fc0bd80 err=2, rip=ffff82d08018e516 (XEN) sh error: sh_page_fault__guest_4(): Page access -w- for gmfn=134eb p2ma: 5 <the above prints are repeated 61 times> (XEN) Assertion 'wqv->esp == 0' failed at wait.c:133 (XEN) ----[ Xen-4.5-unstable x86_64 debug=y Tainted: C ]---- (XEN) CPU: 1 (XEN) RIP: e008:[<ffff82d08012feb6>] prepare_to_wait+0x61/0x243 (XEN) RFLAGS: 0000000000010286 CONTEXT: hypervisor (XEN) rax: 0000000000000100 rbx: 0000000000000100 rcx: 0000000000000000 (XEN) rdx: ffff82d0802eff00 rsi: 0000000000000000 rdi: ffff830032f1a000 (XEN) rbp: ffff83003ffef4c8 rsp: ffff83003ffef498 r8: 0000000000000000 (XEN) r9: 0000000000000000 r10: 0000000000000020 r11: 000000000000000a (XEN) r12: ffff830032f0cc70 r13: ffff830032f0ca30 r14: ffff83003ffeff18 (XEN) r15: ffff830032f1a000 cr0: 000000008005003b cr4: 00000000000426b0 (XEN) cr3: 00000000130f3000 cr2: ffff88000fc0bd80 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen stack trace from rsp=ffff83003ffef498: (XEN) ffff83003ffef4e4 ffff830032f0c9f0 ffff830032f0ca30 ffff830032f24000 (XEN) ffff83003ffef7a8 0000000000000006 ffff83003ffef508 ffff82d0801f23f1 (XEN) 000001003ffef528 fffffff000000030 ffff83003ffef540 ffff830032f24000 (XEN) ffff830032f15bc0 ffff830032f24000 ffff83003ffef528 ffff82d0801f7003 (XEN) 0000000000000000 ffff830032f1a000 ffff83003ffef758 ffff82d08021ab68 (XEN) ffff830000000005 0000000000000002 000000000000000b 0000000000000058 (XEN) ffff82d080319b78 00000000000003f0 0000000000000000 0000000000000880 (XEN) ffff83003faf8430 0000000000000108 0000000004040008 ffff82d080319b68 (XEN) 000ffff88000fc0b fffeffff80166300 0000000000013400 ffff88000fc0bd80 (XEN) 0000000000000001 0000000000000000 00000000000134eb 00000000000134eb (XEN) ffff83003ffef5e8 0000000000000046 ffff83003ffef608 0000000000000040 (XEN) ffff83003ffef658 0000000000000046 ffff830032f15bc0 0000000000000005 (XEN) 000000fc04040008 ffff83003ffef688 ffff83003ffef658 0000000000000040 (XEN) ffff82d0802eff00 0000000000000001 ffff83003ffef700 0000002fbfcd8300 (XEN) ffff83003ffef668 ffff82d080181fc2 ffff83003ffef678 ffff82d0801824e6 (XEN) ffff83003ffef6c8 ffff82d080128a81 0000000000000001 0000000000000000 (XEN) 0000000000000000 0000000000000000 ffff83003fec9130 ffff83003faf8840 (XEN) 0000000000000001 ffff830032f0ccd0 ffff83003ffef778 ffff82d08011f191 (XEN) 0000000000000000 ffff88000fc0bd80 00000000333bc067 00000000333b8067 (XEN) 0000000039f8a067 80100000134eb067 0000000000014ec2 00000000000333bc (XEN) Xen call trace: (XEN) [<ffff82d08012feb6>] prepare_to_wait+0x61/0x243 (XEN) [<ffff82d0801f23f1>] __mem_event_claim_slot+0x56/0xb2 (XEN) [<ffff82d0801f7003>] mem_access_send_req+0x2e/0x5a (XEN) [<ffff82d08021ab68>] sh_page_fault__guest_4+0x1d0a/0x24a2 (XEN) [<ffff82d08018e2fd>] do_page_fault+0x386/0x532 (XEN) [<ffff82d08022adfd>] handle_exception_saved+0x2e/0x6c (XEN) [<ffff82d08018e516>] __copy_to_user_ll+0x2a/0x37 <==== pagefault happens here (XEN) [<ffff82d08015fe7d>] update_runstate_area+0xfd/0x106 (XEN) [<ffff82d08015fe97>] _update_runstate_area+0x11/0x39 (XEN) [<ffff82d08015ff71>] context_switch+0xb2/0xf05 (XEN) [<ffff82d080125bd3>] schedule+0x5c8/0x5d7 (XEN) [<ffff82d0801283f1>] wait+0x9/0xb (XEN) [<ffff82d0801f2407>] __mem_event_claim_slot+0x6c/0xb2 (XEN) [<ffff82d0801f7003>] mem_access_send_req+0x2e/0x5a (XEN) [<ffff82d08021ab68>] sh_page_fault__guest_4+0x1d0a/0x24a2 (XEN) [<ffff82d08018e2fd>] do_page_fault+0x386/0x532 (XEN) [<ffff82d08022adfd>] handle_exception_saved+0x2e/0x6c (XEN) [<ffff82d08018e516>] __copy_to_user_ll+0x2a/0x37 <==== pagefault happens here (XEN) [<ffff82d08015fe7d>] update_runstate_area+0xfd/0x106 (XEN) [<ffff82d08015fe97>] _update_runstate_area+0x11/0x39 (XEN) [<ffff82d080160db1>] context_switch+0xef2/0xf05 (XEN) [<ffff82d080125bd3>] schedule+0x5c8/0x5d7 (XEN) [<ffff82d080128959>] __do_softirq+0x81/0x8c (XEN) [<ffff82d0801289b2>] do_softirq+0x13/0x15 (XEN) [<ffff82d08015d2ae>] idle_loop+0x62/0x72 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 1: (XEN) Assertion 'wqv->esp == 0' failed at wait.c:133 (XEN) **************************************** Xen-access output: # ./xen-access 2 write xenaccess init max_pages = 10100 ring_mfn: 0x30f90 starting write 2 Got event from Xen <host crashed> In the above you can see the RIP where the fault is occurring matches with that of __copy_to_user_ll() and the VA is the runstate area registered by the guest. When a listener attaches to a running domain shadow is enabled and shadow_blow_tables() is called. This means new entries are created with access permissions of not writable. The first and only RIP I see from the above prints is coming from Xen. Guest code does not even get a chance to execute. I guess this is because Xen is trying to schedule the guest and hence trying to update the guest runstate area which causes the pagefault and an event to be sent to the ring. Now take the similar scenario in the hypothetical HVM + EPT case where we are policing Xen writes to guest memory. The above sequence of events would not cause an EPT violation (I realize this can happen only in the guest context) or pagefault (The page is present and marked writable in the guest). All that would happen is an event to be sent to the listener from __hvm_copy() and no cascading faults will occur. In the PV case as the pagefault occurs in the Xen context which I am guessing would need to be handled right away as there is no pausing the CPU. I would venture to guess that if we implemented mem_access for HVM guests running with shadow pagetables, the PV situation would arise there too. The one thing I am not able to figure out is why doesn't the listener i.e. Dom0's VCPU get to run and process the events in the window between access enable and process events loop. I am not familiar with the Xen scheduler well enough to know how it would react to cascading pagefaults occurring for a guest area in the Xen context like it is happening above. I have even tried pinning Dom0 and the guest on different CPUs but this still occurs. I would be grateful if you could provide some insight here. Looking at what the solution for the ring being full in the PV case whether we are policing Xen writes or not, calling wait() will not work due to the scenario I had mentioned a while back and is shown above in the stack trace. I am repeating that flow here mem_event_claim_slot() -> mem_event_wait_slot() -> wait_event(mem_event_wait_try_grab(med, &rc) != -EBUSY) wait_event() macro looks like this: do { if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) break; for ( ; ; ) { prepare_to_wait(&med->wq); if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) break; wait(); } finish_wait(&med->wq); } while (0) In the case where the ring is full, wait() gets called and the cpu gets scheduled away. But since it is in middle of a pagefault, when it runs again it ends up in handle_exception_saved and the same pagefault is tried again. But since finish_wait() never ends up being called wqv->esp never becomes 0 and hence the assert fires on the next go around. So I think we should be calling process_pending_softirqs() instead of wait() for PV domains. If there is a better solution please let me know and I will look in to implementing that. Thanks, Aravindh _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.