Xen project Mailing List

On Tue, Feb 7, 2017 at 11:57 AM, Razvan Cojocaru <rcojocaru@xxxxxxxxxxxxxxx> wrote:

On 02/07/2017 08:39 PM, Andrew Cooper wrote:
> On 07/02/17 18:31, Razvan Cojocaru wrote:
>> On 02/07/2017 08:15 PM, Tamas K Lengyel wrote:
>>>
>>> On Tue, Feb 7, 2017 at 9:53 AM, Razvan Cojocaru
>>> <rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@bitdefender.com>> wrote:
>>>
>>> Hello,
>>>
>>> Setting, e.g. 16 VCPUs for a HVM guest, ends up blocking the guest
>>> completely when subscribing to vm_events, apparently because of this
>>> code in xen/common/vm_event.c:
>>>
>>> 315 /* Give this vCPU a black eye if necessary, on the way out.
>>> 316 * See the comments above wake_blocked() for more information
>>> 317 * on how this mechanism works to avoid waiting. */
>>> 318 avail_req = vm_event_ring_available(ved);
>>> 319 if( current->domain == d && avail_req < d->max_vcpus )
>>> 320 vm_event_mark_and_pause(current, ved);
>>>
>>> It would appear that even if the guest only has 2 online VCPUs, the
>>> "avail_req < d->max_vcpus" condition will pause current, and we
>>> eventually end up with all the VCPUs paused.
>>>
>>> An ugly hack ("avail_req < 2") has allowed booting a guest with many
>>> VCPUs (max_vcpus, the guest only brings 2 VCPUs online), however that's
>>> just to prove that that was the culprit - a real solution to this needs
>>> more in-depth understading of the issue and potential solution. That's
>>> basically very old code (pre-2012 at least) that got moved around into
>>> the current shape of Xen today - please CC anyone relevant to the
>>> discussion that you're aware of.
>>>
>>> Thoughts?
>>>
>>>
>>> I think is a side-effect of the growth of the vm_event structure and the
>>> fact that we have a single page ring. The check effectively sets a
>>> threshold of having enough space for each vCPU to place at least one
>>> more event on the ring, and if that's not the case it gets paused. OTOH
>>> I think this would only have an effect on asynchronous events, for all
>>> other events the vCPU is already paused. Is that the case you have?
>> No, on the contrary, all my events are synchronous (the VCPU is paused
>> waiting for the vm_event reply).
>>
>> I've debugged this a bit, and the problem seems to be that
>> vm_event_wake_blocked() breaks here:
>>
>> 150 /* We remember which vcpu last woke up to avoid scanning always
>> linearly
>> 151 * from zero and starving higher-numbered vcpus under high load */
>> 152 if ( d->vcpu )
>> 153 {
>> 154 int i, j, k;
>> 155
>> 156 for (i = ved->last_vcpu_wake_up + 1, j = 0; j <
>> d->max_vcpus; i++, j++)
>> 157 {
>> 158 k = i % d->max_vcpus;
>> 159 v = d->vcpu[k];
>> 160 if ( !v )
>> 161 continue;
>> 162
>> 163 if ( !(ved->blocked) || online >= avail_req )
>> 164 break;
>> 165
>> 166 if ( test_and_clear_bit(ved->pause_flag, &v->pause_flags) )
>> 167 {
>> 168 vcpu_unpause(v);
>> 169 online++;
>> 170 ved->blocked--;
>> 171 ved->last_vcpu_wake_up = k;
>> 172 }
>> 173 }
>> 174 }
>>
>> at "if ( !(ved->blocked) || online >= avail_req )". At this point,
>> nothing ever gets unblocked. It's hard to believe that this is desired
>> behaviour, as I don't know what could possibly happen for that condition
>> to become false once all the online VCPUs are stuck (especially when the
>> guest has just started booting).

Ah I see what happens. During boot vCPU 0 generates an event and gets marked blocked because the number of vCPUs is so high. The other vCPUs are still unblocked since they are idle, but this test here will still be true (online >= avail_req) and thus we can never unblock vCPU0. And then the boot process is hanging because vCPU0 never resumes. I would argue that this test should be changed to check that there is at least 1 spot on the ring and only break if that is not the case anymore (ie. instead of incrementing online we should be decrementing avail_req).

>
> I wouldn't bet that this logic has ever been tested. If you recall, the
> addition of register state into the vm_event ring made each entry far
> larger, which in turns makes it more likely to hit this condition.
>
> However, simply fixing the logic to re-online the cpus isn't a good
> solution either, as having $N vcpus paused at any one time because of
> ring contention is not conducive good system performance.
>
> Realistically, the ring size needs to be max_cpus * sizeof(largest
> vm_event) at an absolute minimum, and I guess this is now beyond 1 page?

Yes, of course the reason this triggers earlier now is the growth of the
request's size. Yes, using e.g. 20 VCPUs in the guest's setup will
exceed a page's number of slots.

And yes, ideally we should have multi-page ring buffers - however that
is a long-term project that requires design changes in other parts of
Xen as well (Andrew, CCd here, was recently talking about one).

However, even with a one-page ring buffer, surely it's not good to end
up in this situation, especially for guests such as mine, which never
actually bring more than 2 VCPUs online. But even if they were to use
more, blocking the guest on vm_event init is completely pointless - we
might as well return some kind of error if max_vcpus > available slots.

I don't follow the system performance argument. Surely completely
blocking the guest is worse.

I also don't see the point in marking a vCPU blocked if it is already paused. I think this behavior of blocking vCPUs makes only sense for asynchronous events. Razvan, could you test what happens if vm_event_mark_and_pause is only called if the vCPU is unpaused?

Tamas

Re: [Xen-devel] Enabling vm_event for a guest with more VCPUs than available ring buffer slots freezes the virtual machine