[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xenbus and the message of doom



On 15.12.2011 20:39, Konrad Rzeszutek Wilk wrote:
> On Thu, Dec 15, 2011 at 08:20:23PM +0100, Stefan Bader wrote:
>> I was investigating a bug report[1] about newer kernels (>3.1) not booting as
>> HVM guests on Amazon EC2. For some reason git bisect did give the some pain, 
>> but
>> it lead me at least close and with some crash dump data I think I figured the
>> problem.
> 
> Stefan, thanks for finding this.
> 

I realize I wanted to add the reference to our bug report but completely forgot
to do so. So just for completeness:

http://bugs.launchpad.net/bugs/901305


> Olaf, what are your thoughts? Should I prep a patch to revert the patch
> below and then we can work on 3.3 and rethink this in 3.3? The clock is
> ticking for 3.2 and there is not much runway to fix stuff.
> 
>>
>> commit ddacf5ef684a655abe2bb50c4b2a5b72ae0d5e05
>> Author: Olaf Hering <olaf@xxxxxxxxx>
>> Date:   Thu Sep 22 16:14:49 2011 +0200
>>
>>     xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old
>>     kernel
>>
>> This change introduced a xs_reset_watches() call. The problem seems to be 
>> that
>> there is at least some version of Xen (I was able to reproduce with a 3.4.3
>> version which I admit to deliberately not having updated) for which xenstore
>> will not return any reply.
> 
> And oxenstore too, but Ian prepped a patch for this. Perhaps that is
> what Amazon is running.
>>
>> At least the backtraces in crash showed that xs_init had been calling
>> xs_reset_watches() and that was happily idling in read_reply(). Effectively
>> nothing was going on and the boot just hung.
> 
> So at least we should have a timeout read_reply. But I don't see
> anything in the code that we could immediately use.
> 
>> By just not doing that xs_reset_watches() call, I was able to boot under the
>> same host. And for what it is worth there has not been an issue with Xen 
>> 4.1.1
>> and a 3.0 dom0 kernel. Just this "older" release is trouble.
>>
>> Now the big question is, should this never happen and the host needs urgent
>> updating. Or, should xs_talkv() set up a time limit and assume failure when 
>> not
>> receiving a message after that? I could imagine the latter might lead at 
>> least
>> to a more helpful "there is something wrong here, dude" than just hanging 
>> around
>> without any response. ;)
>>
>> -Stefan
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.