[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: x86/vmx: Don't spuriously crash the domain when INIT is received


  • To: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Fri, 25 Feb 2022 14:19:39 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=4pKluLIw0bhgHTRI9o+lm9WG2x0lqHbv0wNACbT5vRA=; b=jawnUa7jQQ4FQIjLayEEnrAbDIwZaAq49guMlkuu1O+9kJo6u5we2cv0kzlnRXzFcKWcQg9SdM3m4IahAvEQr0sMWKvG1Lirnqm2U+NqKQzNOT+7eU95iCI6DFfel2a6tYW2Dk7ar/wSR6aHQKjX3yzK28eGcHAq7YSwRelhKkaN64CX5BDomWlfm4U3muDzrGQTTGx0n2m8ECEe/zFwPL+rs9C9CUfQM4wqLb4nxeOYua8hP6wg0uLKjq5wfmZCpGbW/wtvddsOWXTu6+/dcizFhBEPbjq/ymVeR9s46dsaHbui7aIEDnSmOG/vibgRad5gh2An1yQGlXOnr2mtTA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=FBRahlXlXTNKe/wnYXzHSXfPiL4MT4SDa6Exoqw4DfjP9lG9ibxhAmeJDjRrR8Q8mAwirgzpkVkI9rGA+dYXjLAh+pSO+pjNJ1R3VW/vQFDHxh0vI9WtZJHmkhGTYVXC4+hrUW58szlFglWuOu2Bt5oiGu2MRQYJCAaF8dPn7pW/sFqAm1SHKIdKSEZP9itcqaTcqrDnUcJVYsDAbJXGFBkfr1C7hkazwFeepPRwuy4XVrZVNHYQyNsIHZdhbr3keN2ZByy9vsNy5V491paWvZ9tqR63z0z9hITnJBlFO2YMyeeMyFGwjEaDoT3oGQazjbpSFTFOu2BdQ+7KpL/s1Q==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Roger Pau Monne <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Thiner Logoer <logoerthiner1@xxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Fri, 25 Feb 2022 13:19:53 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 25.02.2022 13:28, Andrew Cooper wrote:
> On 25/02/2022 08:44, Jan Beulich wrote:
>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT 
>>> has
>>> nothing to do with the guest in question, simply signals that an INIT was
>>> received.
>>>
>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>> debugging.  Crashing the domain which happens to be in context is definitely
>>> wrong.  Print an error message and continue.
>>>
>>> Discovered as collateral damage from when an AP triple faults on S3 resume 
>>> on
>>> Intel TigerLake platforms.
>> I'm afraid I don't follow the scenario, which was (only) outlined in
>> patch 1: Why would the BSP receive INIT in this case?
> 
> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
> fault is one cause, but other sources include a double #MC, etc.
> 
> Some external component, in the PCH I expect, needs to turn this into a
> platform reset, because one malfunctioning core can't.  It is why a
> triple fault on any logical processor brings the whole system down.

I'm afraid this doesn't answer my question. Clearly the system didn't
shut down. Hence I still don't see why the BSP would see INIT in the
first place.

>> And it also cannot be that the INIT was received by the vCPU while running on
>> another CPU:
> 
> It's nothing (really) to do with the vCPU.  INIT is a external signal to
> the (real) APIC, just like NMI/etc.
> 
> It is the next VMEntry on a CPU which received INIT that suffers a
> VMEntry failure, and the VMEntry failure has nothing to do with the
> contents of the VMCS.
> 
> Importantly for Xen however, this isn't applicable for scheduling PV
> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
> single CPU.
> 
> 
> The change in INIT behaviour exists for TXT, where is it critical that
> software can clear secrets from RAM before resetting.  I'm not wanting
> to get into any of that because it's far more complicated than I have
> time to fix.

I guess there's something hidden behind what you say here, like INIT
only being latched, but this latched state then causing the VM entry
failure. Which would mean that really the INIT was a signal for the
system to shut down / shutting down. In which case arranging to
continue by ignoring the event in VMX looks wrong. Simply crashing
the guest would then be wrong as well, of course. We should shut
down instead.

But I don't think I see the full picture here yet, unless your
mentioning of TXT was actually implying that TXT was active at the
point of the crash (which I don't think was said anywhere).

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.