[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] DomU crash during migration when suspending source domain


  • To: "Graham, Simon" <Simon.Graham@xxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • From: Keir Fraser <Keir.Fraser@xxxxxxxxxxxx>
  • Date: Wed, 14 Feb 2007 10:36:09 +0000
  • Delivery-date: Wed, 14 Feb 2007 02:35:30 -0800
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: AcdP6h4+HveIAzruQ3+gt7NQNapEGwANqzaeAADJUVA=
  • Thread-topic: [Xen-devel] DomU crash during migration when suspending source domain

Your theory that the cpu_down() is happening too early sounds plausible
except that cpu_up/cpu_down are both entirely protected by the hotplug lock.
See their definitions in kernel/cpu.c.

The notifier calls of interest are CPU_ONLINE and CPU_DEAD. These are the
events that the cacheinfo code cares about. You can see that both
notifications are broadcast under the cpu_hotplug_lock, so there should be
no race possible in which a CPU starts to be taken down before all
notification work associated with it coming online has completed.

 -- Keir

On 14/2/07 10:13, "Keir Fraser" <Keir.Fraser@xxxxxxxxxxxx> wrote:

> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
> hotplug issue in Linux, but we did so lots of testing of that...
> 
>  -- Keir
> 
> On 14/2/07 03:42, "Graham, Simon" <Simon.Graham@xxxxxxxxxxx> wrote:
> 
>> Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with
>> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual
>> panic
>> is attached at the end of this, but the bottom line is that the code in
>> cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
>> attempting to clean up the cache info for a processor that does not yet have
>> this info setup - the code is dereferencing a pointer in the cpuid4_info[]
>> array and looking at the dump I can see that this is NULL.
>> 
>> My working theory here is that we attempted the migration waaay early and the
>> initialization of the array of cache info pointers was not setup for all
>> processors yet; it would be relatively easy to protect against this by
>> checking for NULL, but I'm not sure if this is the correct solution or not --
>> if anyone is familiar with this code and can comment on an appropriate fix
>> I'd
>> be grateful.
>> 
>> One thing I'm really not sure about is the timing of marking the CPUs up with
>> respect to the trace re initializing CPUs (see console output below) -- I can
>> see that the four VCPUs are setup in the cpu_sys_devices array (which is
>> setup
>> by the code that outputs the 'Initializing CPU#n' trace) but the array of
>> cache info structures only has an entry for VCPU 0:
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.