[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load


  • To: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Tue, 20 Sep 2022 10:01:09 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=iGh7TJp3vd4Xz7oPly1MlJad6+tATQGi4gsDzYkglUw=; b=dF8fqDbCsInJdMCSVYBlq/a21zEcTDafX+MzBirXzWWoMrluZhYHY7QlKH4yB+HTJI7ar1HsA9U/Q2SsGeUhnqvZekzc/ivZJsAQJEvRoNpjTo3e0Ml91cPgbVIRJ1d3WeHPQaBlc6x/BBne0LRK67z/vYhYvX+II+QDHMWsGyII+mBJUp0I9ISUpVGPcb9biZogY7KOjJggANcGy1UjzI0dS38qaYK8pnUTHJwx4An/lS3rfcisAfiWsMX9kB1UBDHk4V4z1ZelnPdfAI3ZqUNK/egUQoLQjgmVO+LBWdldvz3Vcdu1C5vj0KdtXnfhUSLX00S0N0YmfIF1NYvrQA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oQr01c8i+RmBqil7veqyssJsLI/sJiR0FXpCJk4qZkFJ14nuIrcGPTwWwOSUYKU0/LyMuCjN3va0z2VfJb64jA7PeMHFtKwZ7TSypFNwErO9QF9SF/dHVFWVwysW0qrpcgG7e3eO+i1aXY+tatDpXi60aBZhzIAuqYovdW3/bw8kwXWGxrobGdkwpHvzA4rIS9W4BK3Ylsbpsh9bCOPbGgZFoxnvi4OuvDd8yhavNcb4l/XT18JgqCjQHslJJXNFwgPdVu9DB6Pbl3ApsUPAr1vColqtPmhgbadkRGa5kNhE2xtaZcI+1PzDsZUqVcl+ZaNiYr55STWv8Mb0odKTxQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Tamas K Lengyel <tamas.lengyel@xxxxxxxxx>, Tamas K Lengyel <tamas.k.lengyel@xxxxxxxxx>
  • Delivery-date: Tue, 20 Sep 2022 08:01:20 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 20.09.2022 00:42, Boris Ostrovsky wrote:
> 
> 
> On 9/19/22 10:56 AM, Jan Beulich wrote:
>> On 19.09.2022 16:11, Tamas K Lengyel wrote:
>>> On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>>>
>>>> On 19.09.2022 15:24, Tamas K Lengyel wrote:
>>>>> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>>>>>>
>>>>>> On 19.09.2022 14:25, Tamas K Lengyel wrote:
>>>>>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote:
>>>>>>>>>
>>>>>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote:
>>>>>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote:
>>>>>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was
>>>>>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true.
>>>>>>>>>>>
>>>>>>>>>>> The root cause of the bug appears to be the fact that the vPMU
>>>> subsystem
>>>>>>>>>>> doesn't save its state on context_switch.
>>>>>>
>>>>>> For the further reply below - is this actually true? What is the
>>>>>> vpmu_switch_from() (resolving to vpmu_save()) doing then early
>>>>>> in context_switch()?
>>>>>>
>>>>>>>>>>> The vpmu_load function will attempt
>>>>>>>>>>> to gather the PMU state if its still loaded two different ways:
>>>>>>>>>>>       1. if the current pcpu is not where the vcpu ran before doing
>>>> a remote save
>>>>>>>>>>>       2. if the current pcpu had another vcpu active before doing a
>>>> local save
>>>>>>>>>>>
>>>>>>>>>>> However, in case the prev vcpu is being rescheduled on another
>>>> pcpu its state
>>>>>>>>>>> has already changed and vcpu_runnable is returning true, thus #2
>>>> will trip the
>>>>>>>>>>> ASSERT. The only way to avoid this race condition is to make sure
>>>> the
>>>>>>>>>>> prev vcpu is paused while being checked and its context saved.
>>>> Once the prev
>>>>>>>>>>> vcpu is resumed and does #1 it will find its state already saved.
>>>>>>>>>> While I consider this explanation plausible, I'm worried:
>>>>>>>>>>
>>>>>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c
>>>>>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c
>>>>>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t
>>>> from_guest)
>>>>>>>>>>>            vpmu = vcpu_vpmu(prev);
>>>>>>>>>>>
>>>>>>>>>>>            /* Someone ran here before us */
>>>>>>>>>>> +        vcpu_pause(prev);
>>>>>>>>>>>            vpmu_save_force(prev);
>>>>>>>>>>>            vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
>>>>>>>>>>> +        vcpu_unpause(prev);
>>>>>>>>>>>
>>>>>>>>>>>            vpmu = vcpu_vpmu(v);
>>>>>>>>>>>        }
>>>>>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the
>>>> vcpu
>>>>>>>>>> to actually be de-scheduled. Even with IRQs on this is already a
>>>>>>>>>> relatively heavy operation (also including its impact on the remote
>>>>>>>>>> side). Additionally the function is called from context_switch(),
>>>> and
>>>>>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In
>>>>>>>>>> particular: Is there a risk of two CPUs doing this mutually to one
>>>>>>>>>> another? If so, is deadlocking excluded?
>>>>>>>>>>
>>>>>>>>>> Hence at the very least I think the description wants extending, to
>>>>>>>>>> discuss the safety of the change.
>>>>>>>>>>
>>>>>>>>>> Boris - any chance you could comment here? Iirc that's code you did
>>>>>>>>>> introduce.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable
>>>> vcpu or can it be a check on whether vcpu is actually running (e.g.
>>>> RUNSTATE_running)?
>>>>>>>>
>>>>>>>> You cannot safely check for "running", as "runnable" may transition
>>>>>>>> to/from "running" behind your back.
>>>>>>>
>>>>>>> The more I look at this the more I think the only sensible solution is
>>>>>>> to have the vPMU state be saved on vmexit for all vCPUs.
>>>>>>
>>>>>> Do you really mean vmexit? It would suffice if state was reliably
>>>>>> saved during context-switch-out, wouldn't it? At that point the
>>>>>> vCPU can't be resumed on another pCPU, yet.
>>>>>>
>>>>>>> That way all
>>>>>>> this having to figure out where and when a context needs saving during
>>>>>>> scheduling goes away. Yes, it adds a bit of overhead for cases where
>>>>>>> the vCPU will resume on the same pCPU and that context saved could
>>>>>>> have been skipped,
>>>>>>
>>>>>> If you really mean vmexit, then I'm inclined to say that's more
>>>>>> than just "a bit of overhead". I'd agree if you really meant
>>>>>> context-switch-out, but as said further up it looks to me as if
>>>>>> that was already occurring. Apparently I'm overlooking something
>>>>>> crucial ...
>>>>>
>>>>> Yes, the current setup is doing exactly that, saving the vPMU context
>>>>> on context-switch-out, and that's where the ASSERT failure occurs
>>>>> because the vCPU it needs to save the context for is already runnable
>>>>> on another pCPU.
>>>>
>>>> Well, if that's the scenario (sorry for not understanding it that
>>>> way earlier on), then the assertion is too strict: While in context
>>>> switch, the vCPU may be runnable, but certainly won't actually run
>>>> anywhere. Therefore I'd be inclined to suggest to relax the
>>>> condition just enough to cover this case, without actually going to
>>>> checking for "running".
>>>>
>>>
>>> What ensures the vCPU won't actually run anywhere if it's in the runnable
>>> state?
>>
>> The fact that the vCPU is the subject of context_switch().
>>
>>> And how do I relax the condition just enough to cover just this case?
>>
>> That's the more difficult question. The immediate solution, passing a
>> boolean or flag to vpmu_switch_from(), may not be practical, as it
>> would likely mean passing this through many layers.
>>
>> Utilizing that I have Jürgen sitting next to me, I've discussed this
>> with him, and he suggested to simply check for v == current. And
>> indeed - set_current() in context_switch() happens a few lines after
>> vpmu_switch_from().
> 
> 
> 
> It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v
> in vmx_find_msr() is not @current:
> 
>       vpmu_load()
>           ...
>           prev = per_cpu(last_vcpu, pcpu);
>           vpmu_save_force(prev)
>               core2_vpmu_save()
>                   __core2_vpmu_save()
>                       vmx_read_guest_msr()
>                           vmx_find_msr()
> 
> 
> The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though 
> whether
> this call is needed when code path above is executed (i.e. when we are saving
> remove vcpu)

How could it not be needed? We need to obtain the guest value. The
thing I don't understand is why this forced saving is necessary,
when context_switch() unconditionally calls vpmu_switch_from().

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.