[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [Patch] Fix for x86_64 boot failures due to bad segment setup for protected mode.
Hi, We have been debugging a nasty VMX issue recently, where certain kernel builds would fail to boot under VMX from syslinux, whereas others booted fine and the same kernel booted OK under grub. In practice, this means that we have only ever seen this on bootable ISO images. The underlying problem turned out to be vmxassist mishandling the setting up of segment registers on entry into VM86_PROTECTED mode. The bug occurs when the guest is entering protected mode and reaches the far jump to set up the 32-bit segment registers for the virtual VMENTER we're about to perform. vmxassist, in protected_mode(), looks at the segment registers, which might be 16-bit segments or might be 32-bit segment selectors at this stage (depending on whether the OS has reloaded them since entering protected mode); and it tries to load the segments from the GDT. Unconditionally. Even if the segment register still actually contains a real mode 16-bit segment. Whoops. Now, enter the *second* bug, this time in the main Linux kernel itself. On x86_64, the kernel boot sequence sets up a bogus GDT: arch/x86_64/boot/setup.S has gdt_48: .word 0x8000 # gdt limit=2048, # 256 GDT entries .word 0, 0 # gdt base (filled in later) which shows that we think we have a 2048-byte GDT, with 256 entries (that all adds up fine); but 2048 is 0x800, not 0x8000. So when we do an lgdt to set this gdt up, we are making it 16 times too big. Unfortunately, when we enter this code from syslinux, SS has a 16-bit value still, 0x3000. That should be way off the end of the GDT and hence illegal as a descriptor even if we mistakenly tried to load it, but because the GDT has been loaded with the wrong size, vmxassist thinks that 0x3000 *IS* a valid segment descriptor, and loads it into the VMCS for the guest's protected mode VMENTER. And so, if, by chance, the 8 bytes at (GDT+0x3000) in the kernel image pass the VMENTER instruction's simple sanity tests, we ignore the problem and shortly afterwards the kernel will load up a valid SS; but if we fail those sanity tests then the VMENTER fails with "bad guest state". It's just luck whether a given vmlinuz works or not. The reason that some kernels show this problem and others do not under Xen is because of the 0x8000 GDT-size kernel bug. But the blame lies squarely with Xen, because the kernel has never loaded any segments from the undefined GDT area above 0x800, and yet vmxassist tried to set up a VMCS segment from it anyway. So, while we would still like to fix the kernel GDT, the *real* problem here is in vmxassist's mishandling of segments during the move to protected mode. Now, vmxassist already has code to detect 16-bit segments that survived unmodified from a transition into and out of protected mode, and to save and restore those appropriately. It does this using "saved_rm_regs", which get cleared on entry to protected mode, and then set to the old segment value if we fail to set a given 32-bit segment correctly. The fix is to save the 16-bit segments *always*, on entry to protected mode when %CR0(PE) is first set; and to clear the saved 16-bit segment and set the 32-bit variant in oldctx whenever a 32-bit segment descriptor is set during the transition to 32-bit CS. Then, when we finally do the VMENTER, we will set up the VMCS from only the 32-bit segments, clearing the VMCS entries for segments that have not been assigned valid 32-bit segments yet. Tested on various RHEL-5 boot.isos, including ones which worked before and ones which triggered the bug; all now boot correctly. Signed-off-by: Stephen Tweedie <sct@xxxxxxxxxx> diff -r 52ae8dd4bc75 tools/firmware/vmxassist/vm86.c --- a/tools/firmware/vmxassist/vm86.c Tue Oct 17 22:09:52 2006 +0100 +++ b/tools/firmware/vmxassist/vm86.c Thu Nov 09 01:24:57 2006 +0000 @@ -847,6 +847,18 @@ load_seg(unsigned long sel, uint32_t *ba } /* + * Emulate a protected mode segment load, falling back to clearing it if + * the descriptor was invalid. + */ +static void +load_or_clear_seg(unsigned long sel, uint32_t *base, uint32_t *limit, union vmcs_arbytes *arbytes) +{ + if (!load_seg(sel, base, limit, arbytes)) + load_seg(0, base, limit, arbytes); +} + + +/* * Transition to protected mode */ static void @@ -857,8 +869,6 @@ protected_mode(struct regs *regs) oldctx.eip = regs->eip; oldctx.esp = regs->uesp; oldctx.eflags = regs->eflags; - - memset(&saved_rm_regs, 0, sizeof(struct regs)); /* reload all segment registers */ if (!load_seg(regs->cs, &oldctx.cs_base, @@ -866,55 +876,16 @@ protected_mode(struct regs *regs) panic("Invalid %%cs=0x%x for protected mode\n", regs->cs); oldctx.cs_sel = regs->cs; - if (load_seg(regs->ves, &oldctx.es_base, - &oldctx.es_limit, &oldctx.es_arbytes)) - oldctx.es_sel = regs->ves; - else { - load_seg(0, &oldctx.es_base, - &oldctx.es_limit, &oldctx.es_arbytes); - oldctx.es_sel = 0; - saved_rm_regs.ves = regs->ves; - } - - if (load_seg(regs->uss, &oldctx.ss_base, - &oldctx.ss_limit, &oldctx.ss_arbytes)) - oldctx.ss_sel = regs->uss; - else { - load_seg(0, &oldctx.ss_base, - &oldctx.ss_limit, &oldctx.ss_arbytes); - oldctx.ss_sel = 0; - saved_rm_regs.uss = regs->uss; - } - - if (load_seg(regs->vds, &oldctx.ds_base, - &oldctx.ds_limit, &oldctx.ds_arbytes)) - oldctx.ds_sel = regs->vds; - else { - load_seg(0, &oldctx.ds_base, - &oldctx.ds_limit, &oldctx.ds_arbytes); - oldctx.ds_sel = 0; - saved_rm_regs.vds = regs->vds; - } - - if (load_seg(regs->vfs, &oldctx.fs_base, - &oldctx.fs_limit, &oldctx.fs_arbytes)) - oldctx.fs_sel = regs->vfs; - else { - load_seg(0, &oldctx.fs_base, - &oldctx.fs_limit, &oldctx.fs_arbytes); - oldctx.fs_sel = 0; - saved_rm_regs.vfs = regs->vfs; - } - - if (load_seg(regs->vgs, &oldctx.gs_base, - &oldctx.gs_limit, &oldctx.gs_arbytes)) - oldctx.gs_sel = regs->vgs; - else { - load_seg(0, &oldctx.gs_base, - &oldctx.gs_limit, &oldctx.gs_arbytes); - oldctx.gs_sel = 0; - saved_rm_regs.vgs = regs->vgs; - } + load_or_clear_seg(oldctx.es_sel, &oldctx.es_base, + &oldctx.es_limit, &oldctx.es_arbytes); + load_or_clear_seg(oldctx.ss_sel, &oldctx.ss_base, + &oldctx.ss_limit, &oldctx.ss_arbytes); + load_or_clear_seg(oldctx.ds_sel, &oldctx.ds_base, + &oldctx.ds_limit, &oldctx.ds_arbytes); + load_or_clear_seg(oldctx.fs_sel, &oldctx.fs_base, + &oldctx.fs_limit, &oldctx.fs_arbytes); + load_or_clear_seg(oldctx.gs_sel, &oldctx.gs_base, + &oldctx.gs_limit, &oldctx.gs_arbytes); /* initialize jump environment to warp back to protected mode */ regs->cs = CODE_SELECTOR; @@ -1002,6 +973,16 @@ set_mode(struct regs *regs, enum vm86_mo case VM86_REAL_TO_PROTECTED: if (mode == VM86_REAL) { regs->eflags |= EFLAGS_TF; + saved_rm_regs.vds = regs->vds; + saved_rm_regs.ves = regs->ves; + saved_rm_regs.vfs = regs->vfs; + saved_rm_regs.vgs = regs->vgs; + saved_rm_regs.uss = regs->uss; + oldctx.ds_sel = 0; + oldctx.es_sel = 0; + oldctx.fs_sel = 0; + oldctx.gs_sel = 0; + oldctx.ss_sel = 0; break; } else if (mode == VM86_REAL_TO_PROTECTED) { break; @@ -1262,6 +1243,10 @@ opcode(struct regs *regs) else regs->ves = pop16(regs); TRACE((regs, regs->eip - eip, "pop %%es")); + if (mode == VM86_REAL_TO_PROTECTED) { + saved_rm_regs.ves = 0; + oldctx.es_sel = regs->ves; + } return OPC_EMULATED; case 0x0F: /* two byte opcode */ _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |