[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Xen-devel] [PATCH V2] x86, amd_ucode: Safeguard against #GP
On 5/30/2014 11:21 AM, Andrew Cooper wrote:
On 30/05/2014 17:01, Aravind Gopalakrishnan wrote:
On 5/28/2014 12:56 PM, boris ostrovsky wrote:
On 5/28/2014 11:16 AM, Aravind Gopalakrishnan wrote:
On 5/27/2014 6:47 PM, Andrew Cooper wrote:
On 27/05/2014 19:24, Aravind Gopalakrishnan wrote:
When HW tries to load a corrupted patch, it generates #GP
and hangs the system. Use wrmsr_safe instead so that we
fail to load microcode gracefully.
Also, massage error handling around apply_microcode to keep
in tune with error handling style of other parts of the code.
Example on a Fam15h system-
(XEN) microcode: CPU0 collect_cpu_info: patch_id=0x6000626
(XEN) microcode: CPU0 size 7870, block size 2586 offset 76 equivID
0x6012 rev 0x6000637
(XEN) microcode: CPU0 found a matching microcode update with version
0x6000637 (current=0x6000626)
(XEN) traps.c:3073: GPF (0000): ffff82d08016f682 -> ffff82d08022d9f8
(XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626
failed
^^^^^^^^^^^^^^^^^^^^^^
As shown, the log message above has the two revisions reversed.
Fix this
Changes in V2:
- Do not ignore return value from wrmsr_safe
- Flip revision numbers as shown above
Signed-off-by: Aravind Gopalakrishnan
<aravind.gopalakrishnan@xxxxxxx>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
I thought we had identified that the hangs were to do with your
use of
'noreboot' on the Xen command line.
Hmm. Yeah.. I figured using wrmsr_safe allows user to just boot
into dom0 without
having to run through reboot loops. (lazy alternative I guess)
Nevermind then. Thanks for the comments (Jan and Andrew). Will keep
in mind for the future.
I don't understand --- the fact that you had noreboot option meant
that your system wouldn't reboot (duh!) when a patch is corrupted
(aka "it will hang"). But I'd think we don't want a reboot neither
--- we want to safely skip the patch (and possibly backlist it).
So, allowing the reboot to happen in turn allows entry to grub.
Where we can simply remove 'ucode=' option and this is same as
'skipping' the patch using wrmsr_safe right?
Except this is now explicitly done by the sysadmin
while wrmsr_safe just works without anyone doing some extra work;
and printing a log message informs user that update went wrong
for whatever reason.
My understanding from earlier comments is that Andrew (and Jan)
would rather not see a change from wrmsrl to wrmsr_safe if it
is needless because there is already a way someone can circumvent the
corrupt patch: just don't provide it on the grub menu.
Thanks,
-Aravind.
Sorry - I think my comment confused the issue. Let me retry.
Originally, the bug was described approximately as "A corrupt ucode
will cause a GP fault, causing the server to hang".
The unhandled #GP fault certainly should be wrapped with wrmsr_safe(),
and an error/warning presented to the user. In the case that a bad
ucode is discovered, it should be discarded and the server allowed to
boot. It is substantially more useful for the server to come up and
say "I couldn't load that bit of microcode you wanted me to", than to
sit in a reboot loop because you made a typo in the bootloader config,
and have to get someone in the datacenter to poke the physical server.
My objection was to the wording of the comment alone. Unhandled #GP
faults do not "hang" Xen unless you ask for them to behave in that
way, given "noreboot" on the command line.
Ah. Okay, let me look into this again then and fix issues with V2.
-Aravind.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|