[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Xen vMCE bugfix: inject vMCE# to all vcpus



>>> On 13.06.12 at 10:05, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx> wrote:
> Xen vMCE bugfix: inject vMCE# to all vcpus
> 
> In our test for win8 guest mce, we find a bug in that no matter what 
> SRAO/SRAR
> error xen inject to win8 guest, it always reboot.
> 
> The root cause is, current Xen vMCE logic inject vMCE# only to vcpu0, this 
> is
> not correct for Intel MCE (Under Intel arch, h/w generate MCE# to all CPUs).
> 
> This patch fix vMCE injection bug, injecting vMCE# to all vcpus.

I see no correlation between the fix (and its description) and the
problem at hand: Why would Win8 reboot if it doesn't receive a
particular MCE on all CPUs? Isn't that model specific behavior?

Furthermore I doubt that an MCE on one socket indeed causes
MCE-s on all other sockets, not to speak of distinct NUMA nodes
(it would already surprise me if MCE-s got broadcast across cores
within a socket, unless they are caused by a resource shared
across cores).

> --- a/xen/arch/x86/cpu/mcheck/mce_intel.c     Tue Jun 05 03:18:00 2012 +0800
> +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c     Wed Jun 13 23:40:45 2012 +0800
> @@ -638,6 +638,32 @@
>      return rec;
>  }
>  
> +static int inject_vmce(struct domain *d)

Is it really necessary to move this vendor independent function
into a vendor specific source file?

> +{
> +    struct vcpu *v;
> +
> +    /* inject vMCE to all vcpus */
> +    for_each_vcpu(d, v)
> +    {
> +        if ( !test_and_set_bool(v->mce_pending) &&
> +            ((d->is_hvm) ? 1 :
> +            guest_has_trap_callback(d, v->vcpu_id, TRAP_machine_check)) )

Quite strange a way to say

            (d->is_hvm || guest_has_trap_callback(d, v->vcpu_id, 
TRAP_machine_check))

> +        {
> +            mce_printk(MCE_VERBOSE, "MCE: inject vMCE to dom%d vcpu%d\n",
> +                       d->domain_id, v->vcpu_id);
> +            vcpu_kick(v);
> +        }
> +        else
> +        {
> +            mce_printk(MCE_QUIET, "Fail to inject vMCE to dom%d vcpu%d\n",
> +                       d->domain_id, v->vcpu_id);
> +            return -1;

Why do you bail here? This is particularly bad if v->mce_pending
was already set on some vCPU (as that could simply mean the guest
just didn't get around to handle the vMCE yet).

> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  static void intel_memerr_dhandler(
>               struct mca_binfo *binfo,
>               enum mce_result *result,

Also, how does this whole change interact with vmce_{rd,wr}msr()?
The struct bank_entry instances live on a per-domain list, so the
vMCE being delivered to all vCPU-s means they will all race for the
single entry (and might erroneously access others, particularly in
vmce_wrmsr()'s MCG_STATUS handling).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.