[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 94442: regressions - FAIL



>>> On 16.05.16 at 11:24, <wei.liu2@xxxxxxxxxx> wrote:
> On Mon, May 16, 2016 at 02:57:13AM +0000, osstest service owner wrote:
>> flight 94442 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/94442/ 
> [...]
>> 
>>  test-amd64-i386-qemuu-rhel6hvm-intel  9 redhat-install    fail REGR. vs. 
>> 94368
> 
> The changes in this flight shouldn't cause failure like this. See below.
> 
> It is more likely to be caused by SMEP/SMAP fix, which are now in
> master. It seems that previous run didn't discover this.
> 
> Log file at:
> 
> http://logs.test-lab.xenproject.org/osstest/logs/94442/test-amd64-i386-qemuu-rhel
>  
> 6hvm-intel/serial-italia0.log
> 
> May 15 22:07:44.023500 (XEN) Xen BUG at entry.S:221
> May 15 22:07:47.455549 (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=y  Not tainted 
> ]----
> May 15 22:07:47.463500 (XEN) CPU:    0
> May 15 22:07:47.463531 (XEN) RIP:    e008:[<ffff82d0802411c7>] 
> cr4_pv32_restore+0x37/0x40
> May 15 22:07:47.463567 (XEN) RFLAGS: 0000000000010287   CONTEXT: hypervisor 
> (d0v3)
> May 15 22:07:47.471503 (XEN) rax: 0000000000000000   rbx: 00000000cf195e50   
> rcx: 0000000000000001
> May 15 22:07:47.479496 (XEN) rdx: ffff8300be907ff8   rsi: 0000000000007ff0   
> rdi: 000000000022287e
> May 15 22:07:47.487498 (XEN) rbp: 00007cff416f80c7   rsp: ffff8300be907f08   
> r8:  ffff83023df8a000
> May 15 22:07:47.495498 (XEN) r9:  ffff83023df8a000   r10: 00000000deadbeef   
> r11: 0000000000800000
> May 15 22:07:47.503510 (XEN) r12: ffff8300bed32000   r13: ffff83023df8a000   
> r14: 0000000000000000
> May 15 22:07:47.503549 (XEN) r15: ffff83023df72000   cr0: 0000000080050033   
> cr4: 00000000001526e0
> May 15 22:07:47.511501 (XEN) cr3: 00000002383d7000   cr2: 00000000b71ff000
> May 15 22:07:47.519493 (XEN) ds: 007b   es: 007b   fs: 00d8   gs: 0033   ss: 
> 0000   cs: e008
> May 15 22:07:47.527520 (XEN) Xen code around <ffff82d0802411c7> 
> (cr4_pv32_restore+0x37/0x40):
> May 15 22:07:47.535491 (XEN)  3b 05 03 87 0a 00 74 02 <0f> 0b 5a 31 c0 c3 0f 
> 1f 00 f6 42 04 01 0f 84 26
> May 15 22:07:47.535531 (XEN) Xen stack trace from rsp=ffff8300be907f08:
> May 15 22:07:47.543502 (XEN)    0000000000000000 ffff82d080240f22 
> ffff83023df72000 0000000000000000
> May 15 22:07:47.551559 (XEN)    ffff83023df8a000 ffff8300bed32000 
> 00000000cf195e6c 00000000cf195e50
> May 15 22:07:47.559494 (XEN)    0000000000800000 00000000deadbeef 
> ffff83023df8a000 0000000000000206
> May 15 22:07:47.567496 (XEN)    0000000000000001 0000000000000001 
> 0000000000000000 0000000000007ff0
> May 15 22:07:47.575503 (XEN)    000000000022287e 0000010000000000 
> 00000000c1001027 0000000000000061
> May 15 22:07:47.575543 (XEN)    0000000000000246 00000000cf195e44 
> 0000000000000069 000000000000beef
> May 15 22:07:47.583508 (XEN)    000000000000beef 000000000000beef 
> 000000000000beef 0000000000000000
> May 15 22:07:47.591503 (XEN)    ffff8300bed30000 0000000000000000 
> 00000000001526e0
> May 15 22:07:47.599493 (XEN) Xen call trace:
> May 15 22:07:47.599522 (XEN)    [<ffff82d0802411c7>] 
> cr4_pv32_restore+0x37/0x40

I think I see the problem the introduction of caching in v3 introduced:
In compat_restore_all_guest we have (getting patched in by altinsn
patching):

.Lcr4_alt:
        testb $3,UREGS_cs(%rsp)
        jpe   .Lcr4_alt_end
        mov   CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp), %rax
        and   $~XEN_CR4_PV32_BITS, %rax
        mov   %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp)
        mov   %rax, %cr4
.Lcr4_alt_end:

If an NMI occurs between the updating og the cached value and the
actual CR4 write, the NMI handling will cause the cached value to get
SMEP+SMAP enabled again (in both cache and CR4), and once we
get back here, we will clear it in just CR4.

We don't want to undo the caching, as that gave us performance back
at least for 64-bit PV guests.

We also can't simply swap the two instructions: If we did, an NMI
between the two would itself trigger the BUG in cr4_pv32_restore
(as the check there assumes that CR4 always has no less of the
bits of interest set than the cached value).

The options I see are:

1) Ditch the debug check altogether, for being false positive in
exactly one corner case.

2) Make the NMI handler recognize the single critical pair of
instructions.

3) Change the code sequence above to

.Lcr4_alt:
        testb $3,UREGS_cs(%rsp)
        jpe   .Lcr4_alt_end
        mov   CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp), %rax
        and   $~XEN_CR4_PV32_BITS, %rax
1:
        mov   %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp)
        mov   %rax, %cr4
        /* (suitable comment goes here) */
        cmp   %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp)
        jne   1b
.Lcr4_alt_end:

(assuming that an insane flood of NMIs not allowing this loop to
be exited would be sufficiently problematic in other ways).

I dislike 1, and between 2 and 3 I think I'd prefer the latter, unless
someone else sees something wrong with such an approach.

> May 15 22:07:47.607524 (XEN) Xen BUG at entry.S:221

A fix for this recursive occurrence was already sent.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.