[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] DomU crash during migration when suspending source domain



Are you migrating between unlike boxes? My guess is that the original box
has processors supporting cacheinfo cpuid leaves and the target box does
not. Migrating to older less-capable CPUs is definitely hit-and-miss I'm
afraid. It really is best not to do it!

 -- Keir

On 14/2/07 10:36, "Keir Fraser" <Keir.Fraser@xxxxxxxxxxxx> wrote:

> Your theory that the cpu_down() is happening too early sounds plausible
> except that cpu_up/cpu_down are both entirely protected by the hotplug lock.
> See their definitions in kernel/cpu.c.
> 
> The notifier calls of interest are CPU_ONLINE and CPU_DEAD. These are the
> events that the cacheinfo code cares about. You can see that both
> notifications are broadcast under the cpu_hotplug_lock, so there should be
> no race possible in which a CPU starts to be taken down before all
> notification work associated with it coming online has completed.
> 
>  -- Keir
> 
> On 14/2/07 10:13, "Keir Fraser" <Keir.Fraser@xxxxxxxxxxxx> wrote:
> 
>> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
>> hotplug issue in Linux, but we did so lots of testing of that...
>> 
>>  -- Keir
>> 
>> On 14/2/07 03:42, "Graham, Simon" <Simon.Graham@xxxxxxxxxxx> wrote:
>> 
>>> Just run into an odd DomU crash doing live migration of a 4-VCPU domain
>>> (with
>>> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual
>>> panic
>>> is attached at the end of this, but the bottom line is that the code in
>>> cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
>>> attempting to clean up the cache info for a processor that does not yet have
>>> this info setup - the code is dereferencing a pointer in the cpuid4_info[]
>>> array and looking at the dump I can see that this is NULL.
>>> 
>>> My working theory here is that we attempted the migration waaay early and
>>> the
>>> initialization of the array of cache info pointers was not setup for all
>>> processors yet; it would be relatively easy to protect against this by
>>> checking for NULL, but I'm not sure if this is the correct solution or not
>>> --
>>> if anyone is familiar with this code and can comment on an appropriate fix
>>> I'd
>>> be grateful.
>>> 
>>> One thing I'm really not sure about is the timing of marking the CPUs up
>>> with
>>> respect to the trace re initializing CPUs (see console output below) -- I
>>> can
>>> see that the four VCPUs are setup in the cpu_sys_devices array (which is
>>> setup
>>> by the code that outputs the 'Initializing CPU#n' trace) but the array of
>>> cache info structures only has an entry for VCPU 0:
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-devel
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.