Xen project Mailing List

On Tue, Feb 7, 2017 at 1:41 PM, Razvan Cojocaru <rcojocaru@xxxxxxxxxxxxxxx> wrote:

On 02/07/2017 10:20 PM, Tamas K Lengyel wrote:
>
>
> On Tue, Feb 7, 2017 at 11:57 AM, Razvan Cojocaru
> <rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@bitdefender.com>> wrote:
>
> On 02/07/2017 08:39 PM, Andrew Cooper wrote:
> > On 07/02/17 18:31, Razvan Cojocaru wrote:
> >> On 02/07/2017 08:15 PM, Tamas K Lengyel wrote:
> >>>
> >>> On Tue, Feb 7, 2017 at 9:53 AM, Razvan Cojocaru
> >>> <rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@bitdefender.com>
> <mailto:rcojocaru@bitdefender.com

> <mailto:rcojocaru@bitdefender.com>>> wrote:
> >>>
> >>> Hello,
> >>>
> >>> Setting, e.g. 16 VCPUs for a HVM guest, ends up blocking the
> guest
> >>> completely when subscribing to vm_events, apparently because
> of this
> >>> code in xen/common/vm_event.c:
> >>>
> >>> 315 /* Give this vCPU a black eye if necessary, on the
> way out.
> >>> 316 * See the comments above wake_blocked() for more
> information
> >>> 317 * on how this mechanism works to avoid waiting. */
> >>> 318 avail_req = vm_event_ring_available(ved);
> >>> 319 if( current->domain == d && avail_req < d->max_vcpus )
> >>> 320 vm_event_mark_and_pause(current, ved);
> >>>
> >>> It would appear that even if the guest only has 2 online
> VCPUs, the
> >>> "avail_req < d->max_vcpus" condition will pause current, and we
> >>> eventually end up with all the VCPUs paused.
> >>>
> >>> An ugly hack ("avail_req < 2") has allowed booting a guest
> with many
> >>> VCPUs (max_vcpus, the guest only brings 2 VCPUs online),
> however that's
> >>> just to prove that that was the culprit - a real solution to
> this needs
> >>> more in-depth understading of the issue and potential
> solution. That's
> >>> basically very old code (pre-2012 at least) that got moved
> around into
> >>> the current shape of Xen today - please CC anyone relevant
> to the
> >>> discussion that you're aware of.
> >>>
> >>> Thoughts?
> >>>
> >>>
> >>> I think is a side-effect of the growth of the vm_event structure
> and the
> >>> fact that we have a single page ring. The check effectively sets a
> >>> threshold of having enough space for each vCPU to place at least one
> >>> more event on the ring, and if that's not the case it gets
> paused. OTOH
> >>> I think this would only have an effect on asynchronous events,
> for all
> >>> other events the vCPU is already paused. Is that the case you have?
> >> No, on the contrary, all my events are synchronous (the VCPU is
> paused
> >> waiting for the vm_event reply).
> >>
> >> I've debugged this a bit, and the problem seems to be that
> >> vm_event_wake_blocked() breaks here:
> >>
> >> 150 /* We remember which vcpu last woke up to avoid scanning
> always
> >> linearly
> >> 151 * from zero and starving higher-numbered vcpus under
> high load */
> >> 152 if ( d->vcpu )
> >> 153 {
> >> 154 int i, j, k;
> >> 155
> >> 156 for (i = ved->last_vcpu_wake_up + 1, j = 0; j <
> >> d->max_vcpus; i++, j++)
> >> 157 {
> >> 158 k = i % d->max_vcpus;
> >> 159 v = d->vcpu[k];
> >> 160 if ( !v )
> >> 161 continue;
> >> 162
> >> 163 if ( !(ved->blocked) || online >= avail_req )
> >> 164 break;
> >> 165
> >> 166 if ( test_and_clear_bit(ved->pause_flag,
> &v->pause_flags) )
> >> 167 {
> >> 168 vcpu_unpause(v);
> >> 169 online++;
> >> 170 ved->blocked--;
> >> 171 ved->last_vcpu_wake_up = k;
> >> 172 }
> >> 173 }
> >> 174 }
> >>
> >> at "if ( !(ved->blocked) || online >= avail_req )". At this point,
> >> nothing ever gets unblocked. It's hard to believe that this is
> desired
> >> behaviour, as I don't know what could possibly happen for that
> condition
> >> to become false once all the online VCPUs are stuck (especially
> when the
> >> guest has just started booting).
>
>
> Ah I see what happens. During boot vCPU 0 generates an event and gets
> marked blocked because the number of vCPUs is so high. The other vCPUs
> are still unblocked since they are idle, but this test here will still
> be true (online >= avail_req) and thus we can never unblock vCPU0. And
> then the boot process is hanging because vCPU0 never resumes. I would
> argue that this test should be changed to check that there is at least 1
> spot on the ring and only break if that is not the case anymore (ie.
> instead of incrementing online we should be decrementing avail_req).

That is exactly what happens. And it can't really be fixed just by
increasing the ring buffer (although that definitely helps a lot and
would be a smart move): no matter how large it is, we can always ask the
domain to use more VCPUs than there are slots in the buffer.

No, not increasing the ring buffer but fixing the check to unblock when there is at least 1 spot on the ring. So look at this comment...

/*
* vm_event_wake_blocked() will wakeup vcpus waiting for room in the
* ring. These vCPUs were paused on their way out after placing an event,
* but need to be resumed where the ring is capable of processing at least
* one event from them.
*/

.. it seems to me that the check "online >= avail_req" was just wrong from the start.

> >
> > I wouldn't bet that this logic has ever been tested. If you
> recall, the
> > addition of register state into the vm_event ring made each entry far
> > larger, which in turns makes it more likely to hit this condition.
> >
> > However, simply fixing the logic to re-online the cpus isn't a good
> > solution either, as having $N vcpus paused at any one time because of
> > ring contention is not conducive good system performance.
> >
> > Realistically, the ring size needs to be max_cpus * sizeof(largest
> > vm_event) at an absolute minimum, and I guess this is now beyond 1
> page?
>
> Yes, of course the reason this triggers earlier now is the growth of the
> request's size. Yes, using e.g. 20 VCPUs in the guest's setup will
> exceed a page's number of slots.
>
> And yes, ideally we should have multi-page ring buffers - however that
> is a long-term project that requires design changes in other parts of
> Xen as well (Andrew, CCd here, was recently talking about one).
>
> However, even with a one-page ring buffer, surely it's not good to end
> up in this situation, especially for guests such as mine, which never
> actually bring more than 2 VCPUs online. But even if they were to use
> more, blocking the guest on vm_event init is completely pointless - we
> might as well return some kind of error if max_vcpus > available slots.
>
> I don't follow the system performance argument. Surely completely
> blocking the guest is worse.
>
>
> I also don't see the point in marking a vCPU blocked if it is already
> paused. I think this behavior of blocking vCPUs makes only sense for
> asynchronous events. Razvan, could you test what happens if
> vm_event_mark_and_pause is only called if the vCPU is unpaused?

It works for me with this change (using Xen 4.7 sources here):

@@ -318,7 +329,11 @@ void vm_event_put_request(struct domain *d,
* on how this mechanism works to avoid waiting. */
avail_req = vm_event_ring_available(ved);
if( current->domain == d && avail_req < d->max_vcpus )
- vm_event_mark_and_pause(current, ved);
+ {
+ if ( !atomic_read( &current->vm_event_pause_count ) )
+ vm_event_mark_and_pause(current, ved);
+ }

Good, that's what I expected. There really is no point in blocking a vCPU that is already paused. But I would just add that check to the if above as another && clause.

Tamas

Re: [Xen-devel] Enabling vm_event for a guest with more VCPUs than available ring buffer slots freezes the virtual machine