[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-API] [ACS4.4, XenServer] Problem starting system VMs



Hi,

I think I’ve tracked this down. I believe it’s a bug in the XenServer’s event 
mechanism, specifically a bug where some shared state causes parallel calls to 
event.from to interfere with each other. From CloudStack’s point of view this 
manifests as

* spurious SESSION_INVALID exceptions in waitForTask, which triggers cleanup 
(Task.destroy), which prevents the VM.start from completing, leaving the VM 
paused
* empty lists of events being returned in non-timeout cases

I’ve prototyped a fix together with a test case (which fails before and passes 
after) and made a pull request containing both:

https://github.com/xapi-project/xen-api/pull/1719

I’d appreciate review from xapi experts, particularly Jon Ludlam (cc:d). I’ve 
also cc:d the main xapi development list.

Cheers,
Dave

On 29 Apr 2014, at 05:15, Mike Tutkowski <mike.tutkowski@xxxxxxxxxxxxx> wrote:

> Actually, the only issue I'm noticing now is the SSVM being automatically
> paused shortly after being created (while creating a new cloud).
> 
> If I go to XenCenter and forcefully shut the VM down, CloudStack restarts
> it OK.
> 
> 
> On Mon, Apr 28, 2014 at 7:34 PM, Mike Tutkowski <
> mike.tutkowski@xxxxxxxxxxxxx> wrote:
> 
>> Figured I'd CC Anthony and Edison to see if they have any input on this
>> (it looks like most of the changes on the relevant file
>> (Xenserver625StorageProcessor.java) were performed by one or the other).
>> 
>> 
>> On Mon, Apr 28, 2014 at 12:40 PM, Mike Tutkowski <
>> mike.tutkowski@xxxxxxxxxxxxx> wrote:
>> 
>>> Thanks for the reply, guys.
>>> 
>>> Just wanted to point out that this is on 4.4 for me (although the issue
>>> may also be present on master).
>>> 
>>> I have a sufficient number of IP addresses for both system and user VMs,
>>> so that should be OK (but good thought, Punith).
>>> 
>>> I plan to continue debugging this later this afternoon, but have been in
>>> meetings all morning.
>>> 
>>> Thanks!
>>> 
>>> 
>>> On Mon, Apr 28, 2014 at 10:41 AM, Dave Scott <Dave.Scott@xxxxxxxxxx>wrote:
>>> 
>>>> Hi,
>>>> 
>>>> (sorry to reply to my own email!)
>>>> 
>>>> On 28 Apr 2014, at 11:42, Dave Scott <Dave.Scott@xxxxxxxxxx> wrote:
>>>> 
>>>>> 
>>>>> Hi Mike,
>>>>> 
>>>>> On 28 Apr 2014, at 04:44, Mike Tutkowski <mike.tutkowski@xxxxxxxxxxxxx>
>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I recently installed 6.2 with XS62ESP1 and XS62ESP1004 (so that
>>>>>> Xenserver625StorageProcessor would be utilized).
>>>>>> 
>>>>>> When I create a cloud from scratch, my SSVM starts up fine, but CPVM
>>>> ends
>>>>>> up in the Paused state. I have to force a shutdown of that VM and then
>>>>>> CloudStack restarts it and it works. This consistently happens. The
>>>> system
>>>>>> VMs are being deployed to the local storage of the one XS host I have
>>>> in my
>>>>>> one and only cluster.
>>>>>> 
>>>>>> Any thoughts on that?
>>>>> 
>>>>> I'm seeing the same symptom on my test cloud with 6.2 and XS62ESP1004.
>>>> I think there's a problem with XenAPI session and task handling in the
>>>> cloudstack master branch, although I've not tracked it down yet. In my
>>>> management server log I see:
>>>>> 
>>>>> WARN  [c.c.h.x.r.CitrixResourceBase] (DirectAgent-5:ctx-47dccee1)
>>>> Unable to start VM(v-2-VM) on host(1c4a31e9-469e-45c3-a0ad-9792ac7b
>>>>> 20f6) due to You gave an invalid session reference.  It may have been
>>>> invalidated by a server restart, or timed out.  You should get
>>>>> a new session handle, using one of the session.login_ calls.  This
>>>> error does not invalidate the current connection.  The handle para
>>>>> meter echoes the bad value given.
>>>>> You gave an invalid session reference.  It may have been invalidated
>>>> by a server restart, or timed out.  You should get a new session
>>>>> handle, using one of the session.login_ calls.  This error does not
>>>> invalidate the current connection.  The handle parameter echoes
>>>>> the bad value given.
>>>>>       at com.xensource.xenapi.Types.checkResponse(Types.java:218)
>>>>>       at com.xensource.xenapi.Connection.dispatch(Connection.java:395)
>>>>>       at
>>>> com.cloud.hypervisor.xen.resource.XenServerConnectionPool$XenServerConnection.dispatch(XenServerConnectionPool.java:463)
>>>>>       at com.xensource.xenapi.Event.from(Event.java:270)
>>>>>       at
>>>> org.apache.cloudstack.hypervisor.xenserver.XenServerResourceNewBase.waitForTask(XenServerResourceNewBase.java:113)
>>>>>       at
>>>> com.cloud.hypervisor.xen.resource.CitrixResourceBase.startVM(CitrixResourceBase.java:3455)
>>>>> 
>>>>> Somehow the XenAPI session being used by the Event.from in the
>>>> XenServerResourceNewBase.waitForTask (used for recent 6.2 XenServers only)
>>>> is being logged-out somewhere. When this happens, the cloudstack cleanup
>>>> code calls Task.cancel and Task.destroy, and then the XenServer
>>>> Async.VM.start fails trying to update Task.progress before it internally
>>>> calls VM.unpause.
>>>>> 
>>>>> I made a hack to disable caching of Connection/sessions:
>>>>> 
>>>>> 
>>>> https://github.com/djs55/cloudstack/commit/a388b71279086e42710e26340df0632d0d8135e4
>>>> 
>>>> For reference / experimentation, I've made a slightly more plausible
>>>> patch:
>>>> 
>>>> 
>>>> https://github.com/djs55/cloudstack/commit/9d40f56c6384d04a5f0fb22e5b97530c0164e0b2
>>>> 
>>>> It catches the SESSION_INVALID in the XenServerConnection and
>>>> transparently logs back in. This would prevent the higher level bits of the
>>>> XenServer plugin from having to deal with sessions being expired beneath
>>>> them.
>>>> 
>>>> Chers,
>>>> Dave
>>>> 
>>>>> 
>>>>> I suspect this now leaks Connections/sessions, but the symptom goes
>>>> away.
>>>>> 
>>>>> So far my thoughts are:
>>>>> 
>>>>> 1. we need to find who's calling session.logout and why -- this will
>>>> help fix the problem in the short term
>>>>> 
>>>>> 2. The XenServer XenAPI bindings are harder to use than they should be
>>>> (IMHO). In particular I think the bindings should take care of handling
>>>> SESSION_INVALID exceptions and re-authenticating transparently, to avoid
>>>> polluting the cloudstack code with rarely-used exception handlers.
>>>>> 
>>>>> 3. the semantics of XenAPI task.destroy could be improved: instead of
>>>> immediately removing the task (which then causes cleanup code to fail
>>>> randomly it seems), it should be more like Unix waitpid with NOHANG i.e.
>>>> set a bit which says, "I'm done with this. Destroy it when you are finished
>>>> with it."
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Also, if I try to kick off a user VM to local storage, I get the
>>>>>> general-purpose InsufficientCapacityException and the virtual router
>>>> does
>>>>>> not even start up.
>>>>> 
>>>>> No idea about this one :)
>>>>> 
>>>>> Cheers,
>>>>> Dave
>>>>> 
>>>>>> 
>>>>>> Can anyone create a similar cloud to what I've described here with XS
>>>> 6.2,
>>>>>> XS62ESP1, and XS62ESP1004? I re-ran this test using a XS 6.1 host and
>>>> it
>>>>>> works just fine.
>>>>>> 
>>>>>> At the moment, this is blocking a test case I'm trying to execute to
>>>> verify
>>>>>> code I had to write in Xenserver625StorageProcessor.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> --
>>>>>> *Mike Tutkowski*
>>>>>> *Senior CloudStack Developer, SolidFire Inc.*
>>>>>> e: mike.tutkowski@xxxxxxxxxxxxx
>>>>>> o: 303.746.7302
>>>>>> Advancing the way the world uses the
>>>>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>>>>> *(tm)*
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> *Mike Tutkowski*
>>> *Senior CloudStack Developer, SolidFire Inc.*
>>> e: mike.tutkowski@xxxxxxxxxxxxx
>>> o: 303.746.7302
>>> Advancing the way the world uses the 
>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>> *(tm)*
>>> 
>> 
>> 
>> 
>> --
>> *Mike Tutkowski*
>> *Senior CloudStack Developer, SolidFire Inc.*
>> e: mike.tutkowski@xxxxxxxxxxxxx
>> o: 303.746.7302
>> Advancing the way the world uses the 
>> cloud<http://solidfire.com/solution/overview/?video=play>
>> *(tm)*
>> 
> 
> 
> 
> -- 
> *Mike Tutkowski*
> *Senior CloudStack Developer, SolidFire Inc.*
> e: mike.tutkowski@xxxxxxxxxxxxx
> o: 303.746.7302
> Advancing the way the world uses the
> cloud<http://solidfire.com/solution/overview/?video=play>
> *(tm)*


_______________________________________________
Xen-api mailing list
Xen-api@xxxxxxxxxxxxx
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.