[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 94442: regressions - FAIL



On 17/05/16 11:57, Jan Beulich wrote:
>>>> On 16.05.16 at 11:24, <wei.liu2@xxxxxxxxxx> wrote:
>> On Mon, May 16, 2016 at 02:57:13AM +0000, osstest service owner wrote:
>>> flight 94442 xen-unstable real [real]
>>> http://logs.test-lab.xenproject.org/osstest/logs/94442/ 
>> [...]
>>>  test-amd64-i386-qemuu-rhel6hvm-intel  9 redhat-install    fail REGR. vs. 
>>> 94368
>> The changes in this flight shouldn't cause failure like this. See below.
>>
>> It is more likely to be caused by SMEP/SMAP fix, which are now in
>> master. It seems that previous run didn't discover this.
>>
>> Log file at:
>>
>> http://logs.test-lab.xenproject.org/osstest/logs/94442/test-amd64-i386-qemuu-rhel
>>  
>> 6hvm-intel/serial-italia0.log
>>
>> May 15 22:07:44.023500 (XEN) Xen BUG at entry.S:221
>> May 15 22:07:47.455549 (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=y  Not 
>> tainted ]----
>> May 15 22:07:47.463500 (XEN) CPU:    0
>> May 15 22:07:47.463531 (XEN) RIP:    e008:[<ffff82d0802411c7>] 
>> cr4_pv32_restore+0x37/0x40
>> May 15 22:07:47.463567 (XEN) RFLAGS: 0000000000010287   CONTEXT: hypervisor 
>> (d0v3)
>> May 15 22:07:47.471503 (XEN) rax: 0000000000000000   rbx: 00000000cf195e50   
>> rcx: 0000000000000001
>> May 15 22:07:47.479496 (XEN) rdx: ffff8300be907ff8   rsi: 0000000000007ff0   
>> rdi: 000000000022287e
>> May 15 22:07:47.487498 (XEN) rbp: 00007cff416f80c7   rsp: ffff8300be907f08   
>> r8:  ffff83023df8a000
>> May 15 22:07:47.495498 (XEN) r9:  ffff83023df8a000   r10: 00000000deadbeef   
>> r11: 0000000000800000
>> May 15 22:07:47.503510 (XEN) r12: ffff8300bed32000   r13: ffff83023df8a000   
>> r14: 0000000000000000
>> May 15 22:07:47.503549 (XEN) r15: ffff83023df72000   cr0: 0000000080050033   
>> cr4: 00000000001526e0
>> May 15 22:07:47.511501 (XEN) cr3: 00000002383d7000   cr2: 00000000b71ff000
>> May 15 22:07:47.519493 (XEN) ds: 007b   es: 007b   fs: 00d8   gs: 0033   ss: 
>> 0000   cs: e008
>> May 15 22:07:47.527520 (XEN) Xen code around <ffff82d0802411c7> 
>> (cr4_pv32_restore+0x37/0x40):
>> May 15 22:07:47.535491 (XEN)  3b 05 03 87 0a 00 74 02 <0f> 0b 5a 31 c0 c3 0f 
>> 1f 00 f6 42 04 01 0f 84 26
>> May 15 22:07:47.535531 (XEN) Xen stack trace from rsp=ffff8300be907f08:
>> May 15 22:07:47.543502 (XEN)    0000000000000000 ffff82d080240f22 
>> ffff83023df72000 0000000000000000
>> May 15 22:07:47.551559 (XEN)    ffff83023df8a000 ffff8300bed32000 
>> 00000000cf195e6c 00000000cf195e50
>> May 15 22:07:47.559494 (XEN)    0000000000800000 00000000deadbeef 
>> ffff83023df8a000 0000000000000206
>> May 15 22:07:47.567496 (XEN)    0000000000000001 0000000000000001 
>> 0000000000000000 0000000000007ff0
>> May 15 22:07:47.575503 (XEN)    000000000022287e 0000010000000000 
>> 00000000c1001027 0000000000000061
>> May 15 22:07:47.575543 (XEN)    0000000000000246 00000000cf195e44 
>> 0000000000000069 000000000000beef
>> May 15 22:07:47.583508 (XEN)    000000000000beef 000000000000beef 
>> 000000000000beef 0000000000000000
>> May 15 22:07:47.591503 (XEN)    ffff8300bed30000 0000000000000000 
>> 00000000001526e0
>> May 15 22:07:47.599493 (XEN) Xen call trace:
>> May 15 22:07:47.599522 (XEN)    [<ffff82d0802411c7>] 
>> cr4_pv32_restore+0x37/0x40
> I think I see the problem the introduction of caching in v3 introduced:
> In compat_restore_all_guest we have (getting patched in by altinsn
> patching):
>
> .Lcr4_alt:
>         testb $3,UREGS_cs(%rsp)
>         jpe   .Lcr4_alt_end
>         mov   CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp), %rax
>         and   $~XEN_CR4_PV32_BITS, %rax
>         mov   %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp)
>         mov   %rax, %cr4
> .Lcr4_alt_end:
>
> If an NMI occurs between the updating og the cached value and the
> actual CR4 write, the NMI handling will cause the cached value to get
> SMEP+SMAP enabled again (in both cache and CR4), and once we
> get back here, we will clear it in just CR4.
>
> We don't want to undo the caching, as that gave us performance back
> at least for 64-bit PV guests.
>
> We also can't simply swap the two instructions: If we did, an NMI
> between the two would itself trigger the BUG in cr4_pv32_restore
> (as the check there assumes that CR4 always has no less of the
> bits of interest set than the cached value).
>
> The options I see are:
>
> 1) Ditch the debug check altogether, for being false positive in
> exactly one corner case.
>
> 2) Make the NMI handler recognize the single critical pair of
> instructions.
>
> 3) Change the code sequence above to
>
> .Lcr4_alt:
>         testb $3,UREGS_cs(%rsp)
>         jpe   .Lcr4_alt_end
>         mov   CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp), %rax
>         and   $~XEN_CR4_PV32_BITS, %rax
> 1:
>         mov   %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp)
>         mov   %rax, %cr4
>         /* (suitable comment goes here) */
>         cmp   %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp)
>         jne   1b
> .Lcr4_alt_end:
>
> (assuming that an insane flood of NMIs not allowing this loop to
> be exited would be sufficiently problematic in other ways).
>
> I dislike 1, and between 2 and 3 I think I'd prefer the latter, unless
> someone else sees something wrong with such an approach.

+1 for option 3.  If we have a flood of NMIs, we have larger problems
than this loop.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.