[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Issue policing writes from Xen to PV domain memory



>I can only repeat what I said above: You first need to understand why the ring
>is (or appears to be) full. But even with that clarified you still need to 
>have a
>proper solution for the case where the ring might end up being full for valid
>reasons. 

I continued digging into if the ring is actually getting full. The ring is 
indeed getting full with events for the runstate area. If I run with my patch 
to wait_event(), the first chance the listener looks for unconsumed ring 
requests it finds around 50-60 of them all pointing to the runstate area page. 
This is proved further below.

I then moved to trying to figure out why it is getting full. I added the 
following patch on top of the one I submitted. It adds couple of prints to 
sh_page_fault(),  allow Xen writes to be policed and print out runstate_guest 
in VCPUOP_register_runstate_memory_area.

diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index aee061c..569cea8 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2846,7 +2846,7 @@ static int sh_page_fault(struct vcpu *v,
     int fast_emul = 0;
 #endif
 
-    SHADOW_PRINTK("d:v=%u:%u va=%#lx err=%u, rip=%lx\n",
+    SHADOW_ERROR("d:v=%u:%u va=%#lx err=%u, rip=%lx\n",
                   v->domain->domain_id, v->vcpu_id, va, regs->error_code,
                   regs->eip);
 
@@ -3073,6 +3073,7 @@ static int sh_page_fault(struct vcpu *v,
              * scheduled during this time, the violation is never resolved and
              * will eventually end with the host crashing.
              */
+#if 0
             if ( (violation && access_w) &&
                  (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
             {
@@ -3080,6 +3081,7 @@ static int sh_page_fault(struct vcpu *v,
                 rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
                                     p2m_ram_rw, p2m_access_rw);
             }
+#endif
 
             if ( violation )
             {
@@ -3089,7 +3091,7 @@ static int sh_page_fault(struct vcpu *v,
                                            access_r, access_w, access_x,
                                            &req_ptr) )
                 {
-                    SHADOW_PRINTK("Page access %c%c%c for gmfn=%"PRI_mfn" "
+                    SHADOW_ERROR("Page access %c%c%c for gmfn=%"PRI_mfn" "
                                   "p2ma: %d\n",
                                   (access_r ? 'r' : '-'),
                                   (access_w ? 'w' : '-'),

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 4291e29..3281195 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1186,6 +1186,7 @@ long do_vcpu_op(int cmd, int vcpuid, 
XEN_GUEST_HANDLE_PARAM(void) arg)
 
         rc = 0;
         runstate_guest(v) = area.addr.h;
+        gdprintk(XENLOG_DEBUG, "runstate_guest: %p\n", runstate_guest(v).p);
 
         if ( v == current )
         {
On attaching xen-access listener to the PV domain, I see the following:

<Xen serial output>

(d2) mapping kernel into physical memory
(d2) about to get started...
(XEN) domain.c:1189:d2v0 runstate_guest: ffffffff81cefd80
(XEN) domain.c:1189:d2v0 runstate_guest: ffff88000fc0bd80
<PV guest is up and running now>
<Attaching xen-access listener>
(XEN) sh error: sh_page_fault__guest_4(): d:v=2:0 va=0xffff88000fc0bd80 err=2, 
rip=ffff82d08018e516
(XEN) sh error: sh_page_fault__guest_4(): Page access -w- for gmfn=134eb p2ma: 5
<the above prints are repeated 61 times>
(XEN) Assertion 'wqv->esp == 0' failed at wait.c:133
(XEN) ----[ Xen-4.5-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82d08012feb6>] prepare_to_wait+0x61/0x243
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: 0000000000000100   rbx: 0000000000000100   rcx: 0000000000000000
(XEN) rdx: ffff82d0802eff00   rsi: 0000000000000000   rdi: ffff830032f1a000
(XEN) rbp: ffff83003ffef4c8   rsp: ffff83003ffef498   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000020   r11: 000000000000000a
(XEN) r12: ffff830032f0cc70   r13: ffff830032f0ca30   r14: ffff83003ffeff18
(XEN) r15: ffff830032f1a000   cr0: 000000008005003b   cr4: 00000000000426b0
(XEN) cr3: 00000000130f3000   cr2: ffff88000fc0bd80
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83003ffef498:
(XEN)    ffff83003ffef4e4 ffff830032f0c9f0 ffff830032f0ca30 ffff830032f24000
(XEN)    ffff83003ffef7a8 0000000000000006 ffff83003ffef508 ffff82d0801f23f1
(XEN)    000001003ffef528 fffffff000000030 ffff83003ffef540 ffff830032f24000
(XEN)    ffff830032f15bc0 ffff830032f24000 ffff83003ffef528 ffff82d0801f7003
(XEN)    0000000000000000 ffff830032f1a000 ffff83003ffef758 ffff82d08021ab68
(XEN)    ffff830000000005 0000000000000002 000000000000000b 0000000000000058
(XEN)    ffff82d080319b78 00000000000003f0 0000000000000000 0000000000000880
(XEN)    ffff83003faf8430 0000000000000108 0000000004040008 ffff82d080319b68
(XEN)    000ffff88000fc0b fffeffff80166300 0000000000013400 ffff88000fc0bd80
(XEN)    0000000000000001 0000000000000000 00000000000134eb 00000000000134eb
(XEN)    ffff83003ffef5e8 0000000000000046 ffff83003ffef608 0000000000000040
(XEN)    ffff83003ffef658 0000000000000046 ffff830032f15bc0 0000000000000005
(XEN)    000000fc04040008 ffff83003ffef688 ffff83003ffef658 0000000000000040
(XEN)    ffff82d0802eff00 0000000000000001 ffff83003ffef700 0000002fbfcd8300
(XEN)    ffff83003ffef668 ffff82d080181fc2 ffff83003ffef678 ffff82d0801824e6
(XEN)    ffff83003ffef6c8 ffff82d080128a81 0000000000000001 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff83003fec9130 ffff83003faf8840
(XEN)    0000000000000001 ffff830032f0ccd0 ffff83003ffef778 ffff82d08011f191
(XEN)    0000000000000000 ffff88000fc0bd80 00000000333bc067 00000000333b8067
(XEN)    0000000039f8a067 80100000134eb067 0000000000014ec2 00000000000333bc
(XEN) Xen call trace:
(XEN)    [<ffff82d08012feb6>] prepare_to_wait+0x61/0x243
(XEN)    [<ffff82d0801f23f1>] __mem_event_claim_slot+0x56/0xb2
(XEN)    [<ffff82d0801f7003>] mem_access_send_req+0x2e/0x5a
(XEN)    [<ffff82d08021ab68>] sh_page_fault__guest_4+0x1d0a/0x24a2
(XEN)    [<ffff82d08018e2fd>] do_page_fault+0x386/0x532
(XEN)    [<ffff82d08022adfd>] handle_exception_saved+0x2e/0x6c
(XEN)    [<ffff82d08018e516>] __copy_to_user_ll+0x2a/0x37            <==== 
pagefault happens here
(XEN)    [<ffff82d08015fe7d>] update_runstate_area+0xfd/0x106
(XEN)    [<ffff82d08015fe97>] _update_runstate_area+0x11/0x39
(XEN)    [<ffff82d08015ff71>] context_switch+0xb2/0xf05
(XEN)    [<ffff82d080125bd3>] schedule+0x5c8/0x5d7
(XEN)    [<ffff82d0801283f1>] wait+0x9/0xb
(XEN)    [<ffff82d0801f2407>] __mem_event_claim_slot+0x6c/0xb2
(XEN)    [<ffff82d0801f7003>] mem_access_send_req+0x2e/0x5a
(XEN)    [<ffff82d08021ab68>] sh_page_fault__guest_4+0x1d0a/0x24a2
(XEN)    [<ffff82d08018e2fd>] do_page_fault+0x386/0x532
(XEN)    [<ffff82d08022adfd>] handle_exception_saved+0x2e/0x6c
(XEN)    [<ffff82d08018e516>] __copy_to_user_ll+0x2a/0x37              <==== 
pagefault happens here
(XEN)    [<ffff82d08015fe7d>] update_runstate_area+0xfd/0x106
(XEN)    [<ffff82d08015fe97>] _update_runstate_area+0x11/0x39
(XEN)    [<ffff82d080160db1>] context_switch+0xef2/0xf05
(XEN)    [<ffff82d080125bd3>] schedule+0x5c8/0x5d7
(XEN)    [<ffff82d080128959>] __do_softirq+0x81/0x8c
(XEN)    [<ffff82d0801289b2>] do_softirq+0x13/0x15
(XEN)    [<ffff82d08015d2ae>] idle_loop+0x62/0x72
(XEN) 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion 'wqv->esp == 0' failed at wait.c:133
(XEN) ****************************************

Xen-access output:

# ./xen-access 2 write
xenaccess init
max_pages = 10100
ring_mfn: 0x30f90
starting write 2
Got event from Xen

<host crashed>

In the above you can see the RIP where the fault is occurring matches with that 
of __copy_to_user_ll() and the VA is the runstate area registered by the guest.

When a listener attaches to a running domain shadow is enabled and 
shadow_blow_tables() is called. This means new entries are created with access 
permissions of not writable. The first and only RIP I see from the above prints 
is coming from Xen. Guest code does not even get a chance to execute. I guess 
this is because Xen is trying to schedule the guest and hence trying to update 
the guest runstate area which causes the pagefault and an event to be sent to 
the ring. Now take the similar scenario in the hypothetical HVM + EPT case 
where we are policing Xen writes to guest memory. The above sequence of events 
would not cause an EPT violation (I realize this can happen only in the guest 
context) or pagefault (The page is present and marked writable in the guest). 
All that would happen is an event to be sent to the listener from __hvm_copy() 
and no cascading faults will occur. In the PV case as the pagefault occurs in 
the Xen context which I am guessing would need to be handled right away as 
there is no pausing the CPU. I would venture to guess that if we implemented 
mem_access for HVM guests running with shadow pagetables, the PV situation 
would arise there too. 

The one thing I am not able to figure out is why doesn't the listener i.e. 
Dom0's VCPU get to run and process the events in the window between access 
enable and process events loop. I am not familiar with the Xen scheduler well 
enough to know how it would react to cascading pagefaults occurring for a guest 
area in the Xen context like it is happening above. I have even tried pinning 
Dom0 and the guest on different CPUs but this still occurs. I would be grateful 
if you could provide some insight here.

Looking at what the solution for the ring being full in the PV case whether we 
are policing Xen writes or not, calling wait() will not work due to the 
scenario I had mentioned a while back and is shown above in the stack trace. I 
am repeating that flow here
mem_event_claim_slot() -> 
        mem_event_wait_slot() ->
                 wait_event(mem_event_wait_try_grab(med, &rc) != -EBUSY)

wait_event() macro looks like this:
do { 
    if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
        break; 
    for ( ; ; ) { 
        prepare_to_wait(&med->wq); 
        if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
            break; 
        wait(); 
    } 
    finish_wait(&med->wq); 
} while (0)

In the case where the ring is full, wait() gets called and the cpu gets 
scheduled away. But since it is in middle of a pagefault, when it runs again it 
ends up in handle_exception_saved and the same pagefault is tried again. But 
since finish_wait() never ends up being called wqv->esp never becomes 0 and 
hence the assert fires on the next go around. So I think we should be calling 
process_pending_softirqs() instead of wait() for PV domains. If there is a 
better solution please let me know and I will look in to implementing that.

Thanks,
Aravindh


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.