[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: x86/vmx: Don't spuriously crash the domain when INIT is received


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>
  • Date: Fri, 25 Feb 2022 17:11:45 +0000
  • Accept-language: en-GB, en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=x/GdCrGTBYXVKADzhTOUA4HbsusNoi9AZZ+DqQY6uWA=; b=Sbd7DpMurB1VezpfyFsJhDFnTiSSKlYB2qcmFywWBA+S1FUPHtB0Fi6jsHEyzpHVOYBK7ZO9/zUbPU9DrRWZbTboR1bzvgASZhc/BURHZNA/5bNg12TawwTsEGUm0lnIfeMODgrNnYc8D7CDrcT5+xyGvB8bhzQilW1vxj1nZfG5OduftX/uC7AhfrD+8j6lZn8/92viWO6HXtgnhMAThcQ6yD+VF5IqLCnhwC/aHpKXM8v0aejY8ax5IwjLY45weMBinwl9b750ELgY4CY6cxXB7CNcqqS/4laOb/QilzI7rU3Dn9C1sREEoRxoY6tcrhHUrquHIYaK54YwkquZ4Q==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=JICo6OVbANMR5UI3P+8PzWsYOB22Z9Gf0n6Ybd4b2d5CfxWkB+aKXUULayxIs8mWL1s67xvza9mgGUsvphZmaddp2xuZmMj0ynbQPJRWuQtIAdLIWSbOeZUOOjLWDWHgSzjcSIDpAtv+lLoXdG2o2OCxrCNMGNyWnjh2pdyRmBVjcNaVHcom/qnQp/qQ5/UcoQIXhX29ZtAgkJuzBuyWXGT7L9m935ZOCwP6wTJDXP0kkOEq6sKBq9Rgft05EfGANaLPjmvcO73AzUWZ4qbLeOLpwtYihDzNn5n1fvg8DybGy0xfHtWq6Stae/xNVCpsPFzK0lYy7D/0EJ8empBw5g==
  • Authentication-results: esa3.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: Roger Pau Monne <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Thiner Logoer <logoerthiner1@xxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Fri, 25 Feb 2022 17:12:27 +0000
  • Ironport-data: A9a23:mthBEKmratUbZwjVRMRYaqro5gx6JkRdPkR7XQ2eYbSJt1+Wr1Gzt xJLWm+EPP3eZmSjeNF+ao/g/BwBuMPQy9dhHQtu/ns0FSMWpZLJC+rCIxarNUt+DCFioGGLT Sk6QoOdRCzhZiaE/n9BCpC48T8kk/vgqoPUUIYoAAgoLeNfYHpn2EoLd9IR2NYy24DjWV7V4 7senuWEULOb828sWo4rw/rrRCNH5JwebxtB4zTSzdgS1LPvvyF94KA3fMldHFOhKmVgJcaoR v6r8V2M1jixEyHBqD+Suu2TnkUiGtY+NOUV45Zcc/DKbhNq/kTe3kunXRa1hIg+ZzihxrhMJ NtxWZOYbQZqbvzBxvshCRgFGCtifpVG/bXtLi3q2SCT5xWun3rExvxvCAc9PJEC+/YxCmZLn RAaAGlTNFbZ3bvwme/lDLk37iggBJCD0Ic3k3ds1zzGS90hRojOWf7i7t5ExjYgwMtJGJ4yY uJHNGI3PUWaOnWjPH9MKbwCzfXwuEPzcgVUuFWNlIsT3UfqmVkZPL/Fb4OOJ43iqd9utkKfq WXL5Xj5AxcXLoW3kGTetHmrg4fnmCrhXYsIGb6Q9/h0gUaSzGgeFB0XU1SgpfCzzEW5Xrp3O 0ESvyYjs6U23EiqVcXmGQ21pmaeuRwRUMYWFPc1gCmOx7TV5UCFB2ECZj9HdNEi8sQxQFQC1 FWEgtfoDjxHq6CORDSW8bL8hSy2ETgYKykFfyBsZQkY5Z/lqYI6jBPKR/5iFrK4ipv+HjSY/ tyRhHFg3fNJ15dNjvjluwCc696xmnTXZio0pVWGUzy60ll4OYWve8+Q6V3Rx/kVee51UWK9l HQDnsGf6sUHApeMiDGBTY0xIV252xqWGGaC2AAyRvHN4xzooif+Jt4IvFmSMW80ap5sRNP/X KPEVeq9Drd3NWDiU6J4apnZ5y8Cnfm5ToSNuhw5g7NzjnlNmO2voHAGia24hTmFfK0QfUcXY 8bznSGEVytyNEif5GDqL9rxKJdyrszE+UvdRIrg0zOs2qeEaXieRN8taQXSM79ltf/Z+F2Nq 76z0vdmLT0FAYUShQGNrOYuwa0idyBnVfgaVeQNHgJ8HuaWMD54UKKAqV/QU4dkg75Uho/1E oKVASdlJK7ErSSfc22iMyk7AJu2BMoXhS9rbEQEYAfzs1B+MNnH0UvqX8ZuFVXR3Lc4lqAco jhsU5joP8mjvRyco2VNNcSm9dc6HPlp7CrXVxeYjPEEV8cIbyTC+8P+fxup8y8LDyGtstA5r aHm3QTeKafvjSw+ZCoKQJpDF2+MgEU=
  • Ironport-hdrordr: A9a23:YDY6Aa9h5CUlksDAS9puk+F+db1zdoMgy1knxilNoENuHPBwxv rAoB1E73PJYW4qKQ0dcdDpAtjlfZtFnaQFoLX5To3SIzUO31HYbL2KjLGSjQEIfheeygcz79 YZT0ETMqyTMbE+t7eG3ODaKadi/DDkytHSuQ629R4EJmsGC9AC0+46MHfgLqQffngaOXNTLu v62iMznUvYRZ1hVLXcOpBqZZmnm/T70LbdJTIWDR8u7weDyRmy7qThLhSe1hACFxtS3LYL6w H+4k7Ez5Tml8v+5g7X1mfV4ZgTssDm0MF/CMuFjdVQAinwizyveJ9qV9S5zXUISaCUmRIXee v30lEd1vdImirsl6aO0EPQMjzboXETArnZuASlaDXY0JbErXkBerR8bMpiA2rkAgwbzY1BOe twrhGkX9A8N2KxoA3to9fPTB1kjUyyvD4rlvMSlWVWVc8EZKZWtpF3xjIZLH6uJlOM1GkLKp gkMCjn3ocdTbpaVQGugkB/hNi3GngjFBaPRUYP/sSTzjhNhXh8i08V3tYWkHsM/I80D8As3Z WKDo140LVVCsMGZ6N0A+kMBcOxF2zWWBrJdGafO07uGq0LM2/E75T3/LI27ue3f4Fg9up/pL 3RFFdD8WIicUPnDsODmJVN7xDWWW24GS/gz8lPjqIJ8oEUhICbeBFrZGpe5vdI+c9vcPEzc8 zDTK5rPw==
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHYKben+W9NHlmgjUe5tEPatIDj2Kyj9FgAgAA+noCAAA5JgIAAQNgA
  • Thread-topic: x86/vmx: Don't spuriously crash the domain when INIT is received

On 25/02/2022 13:19, Jan Beulich wrote:
> On 25.02.2022 13:28, Andrew Cooper wrote:
>> On 25/02/2022 08:44, Jan Beulich wrote:
>>> On 24.02.2022 20:48, Andrew Cooper wrote:
>>>> In VMX operation, the handling of INIT IPIs is changed.  EXIT_REASON_INIT 
>>>> has
>>>> nothing to do with the guest in question, simply signals that an INIT was
>>>> received.
>>>>
>>>> Ignoring the INIT is probably the wrong thing to do, but is helpful for
>>>> debugging.  Crashing the domain which happens to be in context is 
>>>> definitely
>>>> wrong.  Print an error message and continue.
>>>>
>>>> Discovered as collateral damage from when an AP triple faults on S3 resume 
>>>> on
>>>> Intel TigerLake platforms.
>>> I'm afraid I don't follow the scenario, which was (only) outlined in
>>> patch 1: Why would the BSP receive INIT in this case?
>> SHUTDOWN is a signal emitted by a core when it can't continue.  Triple
>> fault is one cause, but other sources include a double #MC, etc.
>>
>> Some external component, in the PCH I expect, needs to turn this into a
>> platform reset, because one malfunctioning core can't.  It is why a
>> triple fault on any logical processor brings the whole system down.
> I'm afraid this doesn't answer my question. Clearly the system didn't
> shut down.

Indeed, *because* Xen caught and ignored the INIT which was otherwise
supposed to do it.

>  Hence I still don't see why the BSP would see INIT in the
> first place.
>
>>> And it also cannot be that the INIT was received by the vCPU while running 
>>> on
>>> another CPU:
>> It's nothing (really) to do with the vCPU.  INIT is a external signal to
>> the (real) APIC, just like NMI/etc.
>>
>> It is the next VMEntry on a CPU which received INIT that suffers a
>> VMEntry failure, and the VMEntry failure has nothing to do with the
>> contents of the VMCS.
>>
>> Importantly for Xen however, this isn't applicable for scheduling PV
>> vCPUs, which is why dom0 wasn't the one that crashed.  This actually
>> meant that dom0 was alive an usable, albeit it sharing all vCPUs on a
>> single CPU.
>>
>>
>> The change in INIT behaviour exists for TXT, where is it critical that
>> software can clear secrets from RAM before resetting.  I'm not wanting
>> to get into any of that because it's far more complicated than I have
>> time to fix.
> I guess there's something hidden behind what you say here, like INIT
> only being latched, but this latched state then causing the VM entry
> failure. Which would mean that really the INIT was a signal for the
> system to shut down / shutting down.

Yes.

> In which case arranging to
> continue by ignoring the event in VMX looks wrong. Simply crashing
> the guest would then be wrong as well, of course. We should shut
> down instead.

It is software's discretion what to do when an INIT is caught, even if
the expectation is to honour it fairly promptly.

> But I don't think I see the full picture here yet, unless your
> mentioning of TXT was actually implying that TXT was active at the
> point of the crash (which I don't think was said anywhere).

This did cause confusion during debugging.  As far as we can tell, TXT
is not active, but the observed behaviour certainly looks like TXT is
active.

Then again, reset is a platform behaviour, not architectural.  Also,
it's my understanding that Intel does not support S3 on TigerLake
(opting to only support S0ix instead), so I'm guessing that "Linux S3"
as it's called in the menu is something retrofitted by the OEM.

But overall, the point isn't really about what triggered the INIT.  We
also shouldn't nuke an innocent VM if an INIT IPI slips through
interrupt remapping.

~Andrew

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.