[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1] psr: fix bug which may cause crash



On 27.11.2019 07:24, Yi Sun wrote:
> During test, we found a crash on Xen with below trace.
> (XEN) Xen call trace:
> (XEN)    [<ffff82d0802a065a>] R psr.c#l3_cdp_write_msr+0x1e/0x22
> (XEN)    [<ffff82d0802a0858>] F psr.c#do_write_psr_msrs+0x6d/0x109
> (XEN)    [<ffff82d08023e000>] F smp_call_function_interrupt+0x5a/0xac
> (XEN)    [<ffff82d0802a2b89>] F call_function_interrupt+0x20/0x34
> (XEN)    [<ffff82d080282c64>] F do_IRQ+0x175/0x6ae
> (XEN)    [<ffff82d08038b8ba>] F common_interrupt+0x10a/0x120
> (XEN)    [<ffff82d0802ec616>] F cpu_idle.c#acpi_idle_do_entry+0x9d/0xb1
> (XEN)    [<ffff82d0802ecc01>] F cpu_idle.c#acpi_processor_idle+0x41d/0x626
> (XEN)    [<ffff82d08027353b>] F domain.c#idle_loop+0xa5/0xa7
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 20:
> (XEN) GENERAL PROTECTION FAULT
> (XEN) [error_code=0000]
> (XEN) ****************************************
> 
> Root cause is that the cache of COS registers are not initialized
> for CAT/CDP which have non-zero default value. That causes invalid
> write to MSR when COS id has exceeded the max number.. So fix it by
> initializing the cache.

I'm struggling with this description, first and foremost because
there's no (recognizable to me) connection between the supposed
root cause and the crash. Exceeding the maximum number is a bug in
some loop's bounds I would say, not an omission of cached value
initialization. In particular I see in do_write_psr_msrs()

        for ( j = 0; j < cos_num; j++, index++ )
        {
            if ( feat->cos_reg_val[cos * cos_num + j] != info->val[index] )
            {
                feat->cos_reg_val[cos * cos_num + j] = info->val[index];
                props->write_msr(cos, info->val[index], props->type[j]);
            }
        }

Afaict the makes clear that values found in ->cos_reg_val[] would
never get written out (which fits it being just a cache). If
anything, a "random" match of the cache value and the two be
written value would _prevent_ an MSR write despite potentially
the MSR in fact currently holding a different value.

Nevertheless a few remarks on the patch itself, just in case
it's just the description that has misguided me.

> --- a/xen/arch/x86/psr.c
> +++ b/xen/arch/x86/psr.c
> @@ -316,6 +316,7 @@ static bool cat_init_feature(const struct cpuid_leaf 
> *regs,
>          [FEAT_TYPE_L3_CDP] = "L3 CDP",
>          [FEAT_TYPE_L2_CAT] = "L2 CAT",
>      };
> +    unsigned int i = 0;

Unnecessary initializer and too wide a scope.

> @@ -332,7 +333,8 @@ static bool cat_init_feature(const struct cpuid_leaf 
> *regs,
>              return false;
>  
>          /* We reserve cos=0 as default cbm (all bits within cbm_len are 1). 
> */
> -        feat->cos_reg_val[0] = cat_default_val(feat->cat.cbm_len);
> +        for(i = 0; i < MAX_COS_REG_CNT; i++)

There are number of blanks missing here (and even more ones in
the other instance below). It also seems to me that the comment
ends up misplaced now. If ...

> +            feat->cos_reg_val[i] = cat_default_val(feat->cat.cbm_len);
>  
>          wrmsrl((type == FEAT_TYPE_L3_CAT ?
>                  MSR_IA32_PSR_L3_MASK(0) :

... this indeed is to remain a single write, it may want to move
here. But as per above keeping cached and actual values in sync
may make it necessary to move this write into the loop as well.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.