[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH for-4.16] x86/cpuid: do not shrink number of leaves in max policies



On 24/11/2021 16:24, Ian Jackson wrote:
> Roger Pau Monne writes ("[PATCH for-4.16] x86/cpuid: do not shrink number of 
> leaves in max policies"):
>> Shrinking max policies can lead to failures in migration as previous
>> versions of Xen didn't shrink the number of leaves in any case, so
>> it's possible for a guest created on previous versions of Xen that
>> pass CPUID data on the migration stream to contain a max leaf number
>> greatest than the one present on the max policies in versions of Xen
>> containing 540d911c28.
>>
>> Such failure was seen by osstest when doing a migration from Xen
>> 4.15 to Xen 4.16-rc on a pair of equal boxes, the noceras.
>>
>> Fix this by preventing any shrinking of the max CPUID policies, so
>> that previously built guest CPUID policies are compatible.
>>
>> Fixes: 540d911c28 ('x86/CPUID: shrink max_{,sub}leaf fields according to 
>> actual leaf contents')
>> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> ...
>> This is a regression from Xen 4.15, so should be considered for Xen
>> 4.16. The main risks would be to mess up with the CPUID policy in a
>> different way, that would also lead to brokenness. Strictly speaking
>> the change here removes the shrinking of max leaves and restores the
>> previous behavior, but it's obviously not completely risk free.
>>
>> It has proven to fix the regression seen on the noceras.
> Ouch.
>
> Questions from my RM hat:
>
> Is there a workaround ?

No.

The safety check being tripped is intended to prevent the VM crashing on
resume, and is functioning correctly.

> What proportion of machines do we think this might affect ?

Any pre-xsave machines (~2012 and older), and any newer machines booted
with no-xsave.

All AMD machines are actually broken by this, except that failure is
being masked by other changes in 4.16.  Future AMD machines will break
in the same way.

> Jan, Andy, do you have an opinion ?

The reversion doesn't go far enough.

While the shrinking of the max policies manifests as a concrete breakage
here, there is further breakage caused by shrinking the default
policies, because it renders some cpuid= settings in VM config files broken.

There is still no feedback or error checking from individual cpuid=
settings, so this will manifest as the VM admin settings silently no
longer taking effect.


I recommend a full and complete reversion of 540d911c28.  The
justification for it in the first place is especially weak because it is
explicitly contrary to how real hardware behaves, and this is the 3rd
ABI breakage it has caused, with more expected in the future based on
the analysis of what has gone wrong so far.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.