[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Enabling vm_event for a guest with more VCPUs than available ring buffer slots freezes the virtual machine
On 02/07/2017 08:39 PM, Andrew Cooper wrote: > On 07/02/17 18:31, Razvan Cojocaru wrote: >> On 02/07/2017 08:15 PM, Tamas K Lengyel wrote: >>> >>> On Tue, Feb 7, 2017 at 9:53 AM, Razvan Cojocaru >>> <rcojocaru@xxxxxxxxxxxxxxx <mailto:rcojocaru@xxxxxxxxxxxxxxx>> wrote: >>> >>> Hello, >>> >>> Setting, e.g. 16 VCPUs for a HVM guest, ends up blocking the guest >>> completely when subscribing to vm_events, apparently because of this >>> code in xen/common/vm_event.c: >>> >>> 315 /* Give this vCPU a black eye if necessary, on the way out. >>> 316 * See the comments above wake_blocked() for more information >>> 317 * on how this mechanism works to avoid waiting. */ >>> 318 avail_req = vm_event_ring_available(ved); >>> 319 if( current->domain == d && avail_req < d->max_vcpus ) >>> 320 vm_event_mark_and_pause(current, ved); >>> >>> It would appear that even if the guest only has 2 online VCPUs, the >>> "avail_req < d->max_vcpus" condition will pause current, and we >>> eventually end up with all the VCPUs paused. >>> >>> An ugly hack ("avail_req < 2") has allowed booting a guest with many >>> VCPUs (max_vcpus, the guest only brings 2 VCPUs online), however that's >>> just to prove that that was the culprit - a real solution to this needs >>> more in-depth understading of the issue and potential solution. That's >>> basically very old code (pre-2012 at least) that got moved around into >>> the current shape of Xen today - please CC anyone relevant to the >>> discussion that you're aware of. >>> >>> Thoughts? >>> >>> >>> I think is a side-effect of the growth of the vm_event structure and the >>> fact that we have a single page ring. The check effectively sets a >>> threshold of having enough space for each vCPU to place at least one >>> more event on the ring, and if that's not the case it gets paused. OTOH >>> I think this would only have an effect on asynchronous events, for all >>> other events the vCPU is already paused. Is that the case you have? >> No, on the contrary, all my events are synchronous (the VCPU is paused >> waiting for the vm_event reply). >> >> I've debugged this a bit, and the problem seems to be that >> vm_event_wake_blocked() breaks here: >> >> 150 /* We remember which vcpu last woke up to avoid scanning always >> linearly >> 151 * from zero and starving higher-numbered vcpus under high load */ >> 152 if ( d->vcpu ) >> 153 { >> 154 int i, j, k; >> 155 >> 156 for (i = ved->last_vcpu_wake_up + 1, j = 0; j < >> d->max_vcpus; i++, j++) >> 157 { >> 158 k = i % d->max_vcpus; >> 159 v = d->vcpu[k]; >> 160 if ( !v ) >> 161 continue; >> 162 >> 163 if ( !(ved->blocked) || online >= avail_req ) >> 164 break; >> 165 >> 166 if ( test_and_clear_bit(ved->pause_flag, &v->pause_flags) ) >> 167 { >> 168 vcpu_unpause(v); >> 169 online++; >> 170 ved->blocked--; >> 171 ved->last_vcpu_wake_up = k; >> 172 } >> 173 } >> 174 } >> >> at "if ( !(ved->blocked) || online >= avail_req )". At this point, >> nothing ever gets unblocked. It's hard to believe that this is desired >> behaviour, as I don't know what could possibly happen for that condition >> to become false once all the online VCPUs are stuck (especially when the >> guest has just started booting). > > I wouldn't bet that this logic has ever been tested. If you recall, the > addition of register state into the vm_event ring made each entry far > larger, which in turns makes it more likely to hit this condition. > > However, simply fixing the logic to re-online the cpus isn't a good > solution either, as having $N vcpus paused at any one time because of > ring contention is not conducive good system performance. > > Realistically, the ring size needs to be max_cpus * sizeof(largest > vm_event) at an absolute minimum, and I guess this is now beyond 1 page? Yes, of course the reason this triggers earlier now is the growth of the request's size. Yes, using e.g. 20 VCPUs in the guest's setup will exceed a page's number of slots. And yes, ideally we should have multi-page ring buffers - however that is a long-term project that requires design changes in other parts of Xen as well (Andrew, CCd here, was recently talking about one). However, even with a one-page ring buffer, surely it's not good to end up in this situation, especially for guests such as mine, which never actually bring more than 2 VCPUs online. But even if they were to use more, blocking the guest on vm_event init is completely pointless - we might as well return some kind of error if max_vcpus > available slots. I don't follow the system performance argument. Surely completely blocking the guest is worse. Thanks, Razvan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |