Xen project Mailing List

Re: [Xen-devel] [xen-unstable test] 94442: regressions - FAIL

To: "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx>, "Wei Liu" <wei.liu2@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Tue, 17 May 2016 04:57:16 -0600

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, osstest service owner <osstest-admin@xxxxxxxxxxxxxx>

Delivery-date: Tue, 17 May 2016 10:57:34 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 16.05.16 at 11:24, <wei.liu2@xxxxxxxxxx> wrote: > On Mon, May 16, 2016 at 02:57:13AM +0000, osstest service owner wrote: >> flight 94442 xen-unstable real [real] >> http://logs.test-lab.xenproject.org/osstest/logs/94442/ > [...] >> >> test-amd64-i386-qemuu-rhel6hvm-intel 9 redhat-install fail REGR. vs. >> 94368 > > The changes in this flight shouldn't cause failure like this. See below. > > It is more likely to be caused by SMEP/SMAP fix, which are now in > master. It seems that previous run didn't discover this. > > Log file at: > > http://logs.test-lab.xenproject.org/osstest/logs/94442/test-amd64-i386-qemuu-rhel > > 6hvm-intel/serial-italia0.log > > May 15 22:07:44.023500 (XEN) Xen BUG at entry.S:221 > May 15 22:07:47.455549 (XEN) ----[ Xen-4.7.0-rc x86_64 debug=y Not tainted > ]---- > May 15 22:07:47.463500 (XEN) CPU: 0 > May 15 22:07:47.463531 (XEN) RIP: e008:[<ffff82d0802411c7>] > cr4_pv32_restore+0x37/0x40 > May 15 22:07:47.463567 (XEN) RFLAGS: 0000000000010287 CONTEXT: hypervisor > (d0v3) > May 15 22:07:47.471503 (XEN) rax: 0000000000000000 rbx: 00000000cf195e50 > rcx: 0000000000000001 > May 15 22:07:47.479496 (XEN) rdx: ffff8300be907ff8 rsi: 0000000000007ff0 > rdi: 000000000022287e > May 15 22:07:47.487498 (XEN) rbp: 00007cff416f80c7 rsp: ffff8300be907f08 > r8: ffff83023df8a000 > May 15 22:07:47.495498 (XEN) r9: ffff83023df8a000 r10: 00000000deadbeef > r11: 0000000000800000 > May 15 22:07:47.503510 (XEN) r12: ffff8300bed32000 r13: ffff83023df8a000 > r14: 0000000000000000 > May 15 22:07:47.503549 (XEN) r15: ffff83023df72000 cr0: 0000000080050033 > cr4: 00000000001526e0 > May 15 22:07:47.511501 (XEN) cr3: 00000002383d7000 cr2: 00000000b71ff000 > May 15 22:07:47.519493 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0033 ss: > 0000 cs: e008 > May 15 22:07:47.527520 (XEN) Xen code around <ffff82d0802411c7> > (cr4_pv32_restore+0x37/0x40): > May 15 22:07:47.535491 (XEN) 3b 05 03 87 0a 00 74 02 <0f> 0b 5a 31 c0 c3 0f > 1f 00 f6 42 04 01 0f 84 26 > May 15 22:07:47.535531 (XEN) Xen stack trace from rsp=ffff8300be907f08: > May 15 22:07:47.543502 (XEN) 0000000000000000 ffff82d080240f22 > ffff83023df72000 0000000000000000 > May 15 22:07:47.551559 (XEN) ffff83023df8a000 ffff8300bed32000 > 00000000cf195e6c 00000000cf195e50 > May 15 22:07:47.559494 (XEN) 0000000000800000 00000000deadbeef > ffff83023df8a000 0000000000000206 > May 15 22:07:47.567496 (XEN) 0000000000000001 0000000000000001 > 0000000000000000 0000000000007ff0 > May 15 22:07:47.575503 (XEN) 000000000022287e 0000010000000000 > 00000000c1001027 0000000000000061 > May 15 22:07:47.575543 (XEN) 0000000000000246 00000000cf195e44 > 0000000000000069 000000000000beef > May 15 22:07:47.583508 (XEN) 000000000000beef 000000000000beef > 000000000000beef 0000000000000000 > May 15 22:07:47.591503 (XEN) ffff8300bed30000 0000000000000000 > 00000000001526e0 > May 15 22:07:47.599493 (XEN) Xen call trace: > May 15 22:07:47.599522 (XEN) [<ffff82d0802411c7>] > cr4_pv32_restore+0x37/0x40 I think I see the problem the introduction of caching in v3 introduced: In compat_restore_all_guest we have (getting patched in by altinsn patching): .Lcr4_alt: testb $3,UREGS_cs(%rsp) jpe .Lcr4_alt_end mov CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp), %rax and $~XEN_CR4_PV32_BITS, %rax mov %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp) mov %rax, %cr4 .Lcr4_alt_end: If an NMI occurs between the updating og the cached value and the actual CR4 write, the NMI handling will cause the cached value to get SMEP+SMAP enabled again (in both cache and CR4), and once we get back here, we will clear it in just CR4. We don't want to undo the caching, as that gave us performance back at least for 64-bit PV guests. We also can't simply swap the two instructions: If we did, an NMI between the two would itself trigger the BUG in cr4_pv32_restore (as the check there assumes that CR4 always has no less of the bits of interest set than the cached value). The options I see are: 1) Ditch the debug check altogether, for being false positive in exactly one corner case. 2) Make the NMI handler recognize the single critical pair of instructions. 3) Change the code sequence above to .Lcr4_alt: testb $3,UREGS_cs(%rsp) jpe .Lcr4_alt_end mov CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp), %rax and $~XEN_CR4_PV32_BITS, %rax 1: mov %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp) mov %rax, %cr4 /* (suitable comment goes here) */ cmp %rax, CPUINFO_cr4-CPUINFO_guest_cpu_user_regs(%rsp) jne 1b .Lcr4_alt_end: (assuming that an insane flood of NMIs not allowing this loop to be exited would be sufficiently problematic in other ways). I dislike 1, and between 2 and 3 I think I'd prefer the latter, unless someone else sees something wrong with such an approach. > May 15 22:07:47.607524 (XEN) Xen BUG at entry.S:221 A fix for this recursive occurrence was already sent. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.