[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] Avoid race when moving cpu between cpupools
Juergen Gross wrote: On 02/25/11 15:25, Andre Przywara wrote:George Dunlap wrote:Looks good -- thanks Juergen. Acked-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx> -George On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara <andre.przywara@xxxxxxx> wrote:Juergen Gross wrote:Moving cpus between cpupools is done under the schedule lock of the moved cpu. When checking a cpu being member of a cpupool this must be done with the lock of that cpu being held.I have reviewed and tested the patch. It fixes my problem. My script has been running for several hundred iterations without any Xen crash, whereas without the patch the hypervisor crashed mostly at the second iteration.Juergen, can you rule out that this code will be triggered on two CPUs trying to switch to each other? As Stephan pointed out: the code looks like as this could trigger a possible dead-lock condition, where: 1) CPU A grabs lock (a) while CPU B grabs lock (b) 2) CPU A tries to grab (b) and CPU B tries to grab (a) 3) both fail and loop to 1)Good point. Not quite a dead-lock, but a possible live-lock :-) Yeah, sorry. That was the wrong wording. Kudos go to Stephan for pointing this out. A possible fix would be to introduce some ordering for the locks (just the pointer address) and let the "bigger" pointer yield to the "smaller" one.Done this and sent a patch. Thanks, it looks good on the first glance. Not yet tested, though. I am not sure if this is really necessary, but I now see strange hangs after running the script for a while (30min to 1hr). Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or getting spurious ones, on two occasions the machine totally locked up. I am not 100% sure whether this is CPUpools related, but I put some load on Dom0 (without messing with CPUpools) for the whole night and it ran fine.Did you try to do this with all Dom0-vcpus pinned to 6 physical cpus? I had the same problems when using only few physical cpus for many vcpus. And I'm pretty sure this was NOT the possible live-lock, as it happened already without this change when I tried to reproduce your problem. That is my current theory, too. I inserted counters in the try-loop for the locks to detect possible lock-ups, but they didn't went over 99, so this is not the reason. The high overcommit (48 vCPUs on 6 pCPUs) is probably responsible. The new reduction of Dom0 vCPUs should avoid this situation in the future. Sorry for this :-( I will try to further isolate this. Anyway, it works much better with the fix than without and I will try to trigger this with the "reduce number of Dom0 vCPUs" patch. Unfortunately I got a Dom0 crash with the new patch. Reverting 22934 worked fine. I will investigate this now. root@dosorca:/data/images# xl cpupool-numa-split (XEN) Domain 0 crashed: rebooting machine in 5 seconds. (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |