[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Enabling vm_event for a guest with more VCPUs than available ring buffer slots freezes the virtual machine



On 02/07/2017 08:39 PM, Andrew Cooper wrote:
> On 07/02/17 18:31, Razvan Cojocaru wrote:
>> On 02/07/2017 08:15 PM, Tamas K Lengyel wrote:
>>>
>>> On Tue, Feb 7, 2017 at 9:53 AM, Razvan Cojocaru
>>> <rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@xxxxxxxxxxxxxxx>> wrote:
>>>
>>>     Hello,
>>>
>>>     Setting, e.g. 16 VCPUs for a HVM guest, ends up blocking the guest
>>>     completely when subscribing to vm_events, apparently because of this
>>>     code in xen/common/vm_event.c:
>>>
>>>     315     /* Give this vCPU a black eye if necessary, on the way out.
>>>     316      * See the comments above wake_blocked() for more information
>>>     317      * on how this mechanism works to avoid waiting. */
>>>     318     avail_req = vm_event_ring_available(ved);
>>>     319     if( current->domain == d && avail_req < d->max_vcpus )
>>>     320         vm_event_mark_and_pause(current, ved);
>>>
>>>     It would appear that even if the guest only has 2 online VCPUs, the
>>>     "avail_req < d->max_vcpus" condition will pause current, and we
>>>     eventually end up with all the VCPUs paused.
>>>
>>>     An ugly hack ("avail_req < 2") has allowed booting a guest with many
>>>     VCPUs (max_vcpus, the guest only brings 2 VCPUs online), however that's
>>>     just to prove that that was the culprit - a real solution to this needs
>>>     more in-depth understading of the issue and potential solution. That's
>>>     basically very old code (pre-2012 at least) that got moved around into
>>>     the current shape of Xen today - please CC anyone relevant to the
>>>     discussion that you're aware of.
>>>
>>>     Thoughts?
>>>
>>>
>>> I think is a side-effect of the growth of the vm_event structure and the
>>> fact that we have a single page ring. The check effectively sets a
>>> threshold of having enough space for each vCPU to place at least one
>>> more event on the ring, and if that's not the case it gets paused. OTOH
>>> I think this would only have an effect on asynchronous events, for all
>>> other events the vCPU is already paused. Is that the case you have?
>> No, on the contrary, all my events are synchronous (the VCPU is paused
>> waiting for the vm_event reply).
>>
>> I've debugged this a bit, and the problem seems to be that
>> vm_event_wake_blocked() breaks here:
>>
>> 150     /* We remember which vcpu last woke up to avoid scanning always
>> linearly
>> 151      * from zero and starving higher-numbered vcpus under high load */
>> 152     if ( d->vcpu )
>> 153     {
>> 154         int i, j, k;
>> 155
>> 156         for (i = ved->last_vcpu_wake_up + 1, j = 0; j <
>> d->max_vcpus; i++, j++)
>> 157         {
>> 158             k = i % d->max_vcpus;
>> 159             v = d->vcpu[k];
>> 160             if ( !v )
>> 161                 continue;
>> 162
>> 163             if ( !(ved->blocked) || online >= avail_req )
>> 164                break;
>> 165
>> 166             if ( test_and_clear_bit(ved->pause_flag, &v->pause_flags) )
>> 167             {
>> 168                 vcpu_unpause(v);
>> 169                 online++;
>> 170                 ved->blocked--;
>> 171                 ved->last_vcpu_wake_up = k;
>> 172             }
>> 173         }
>> 174     }
>>
>> at "if ( !(ved->blocked) || online >= avail_req )". At this point,
>> nothing ever gets unblocked. It's hard to believe that this is desired
>> behaviour, as I don't know what could possibly happen for that condition
>> to become false once all the online VCPUs are stuck (especially when the
>> guest has just started booting).
> 
> I wouldn't bet that this logic has ever been tested.  If you recall, the
> addition of register state into the vm_event ring made each entry far
> larger, which in turns makes it more likely to hit this condition.
> 
> However, simply fixing the logic to re-online the cpus isn't a good
> solution either, as having $N vcpus paused at any one time because of
> ring contention is not conducive good system performance.
> 
> Realistically, the ring size needs to be max_cpus * sizeof(largest
> vm_event) at an absolute minimum, and I guess this is now beyond 1 page?

Yes, of course the reason this triggers earlier now is the growth of the
request's size. Yes, using e.g. 20 VCPUs in the guest's setup will
exceed a page's number of slots.

And yes, ideally we should have multi-page ring buffers - however that
is a long-term project that requires design changes in other parts of
Xen as well (Andrew, CCd here, was recently talking about one).

However, even with a one-page ring buffer, surely it's not good to end
up in this situation, especially for guests such as mine, which never
actually bring more than 2 VCPUs online. But even if they were to use
more, blocking the guest on vm_event init is completely pointless - we
might as well return some kind of error if max_vcpus > available slots.

I don't follow the system performance argument. Surely completely
blocking the guest is worse.


Thanks,
Razvan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.