[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Domain 0 stop response on frequently reboot VMS



Well, Thanks Keir.
Fortunately we caught the bug, it turned out to be a tapdisk problem.
A brief explaination for other guys might confront this issue.
 
Clear  BLKTAP_DEFERRED on line 19 will lead to the concurrent access of
tap->deferred_queue between line 24 and 37, which will finally cause bad
pointer of tap->deferred_queue, and infinte loop in while clause in line 22.
Lock line 24 will be a simple fix.
 
/linux-2.6-pvops.git/drivers/xen/blktap/wait_queue.c
  9 void
 10 blktap_run_deferred(void)
 11 {
 12     LIST_HEAD(queue);
 13     struct blktap *tap;
 14     unsigned long flags;
 15    
 16     spin_lock_irqsave(&deferred_work_lock, flags);
 17     list_splice_init(&deferred_work_queue, &queue);
 18     list_for_each_entry(tap, &queue, deferred_queue)
 19         clear_bit(BLKTAP_DEFERRED, &tap->dev_inuse);
 20     spin_unlock_irqrestore(&deferred_work_lock, flags);
 21    
 22     while (!list_empty(&queue)) {
 23         tap = list_entry(queue.next, struct blktap, deferred_queue);
 24 &nb sp;       list_del_init(&tap->deferred_queue);
 25         blktap_device_restart(tap);
 26     }  
 27 }  
 28
 29 void
 30 blktap_defer(struct blktap *tap)
 31 {
 32     unsigned long flags;
 33    
 34     spin_lock_irqsave(&deferred_work_lock, flags);
 35     if (!test_bit(BLKTAP_DEFERRED, &tap->dev_inuse)) {
 36         set_bit(BLKTAP_DEFERRED, &tap->dev_inuse);
 37         list_add_tail(&tap->deferred_queue, &deferred_work_queue);
 38     }  
 39     spin_unlock_irqrestore(&deferred_work_lock, f lags);
 40 }

 
> Date: Fri, 15 Oct 2010 13:57:09 +0100
> Subject: Re: [Xen-devel] Domain 0 stop response on frequently reboot VMS
> From: keir@xxxxxxx
> To: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>
> You'll probably want to see if you can get SysRq output from dom0 via serial
> line. It's likely you can if it is alive enough to respond to ping. This
> might tell you things like what all processes are getting blocked on, and
> thus indicate what is stopping dom0 from making progress.
>
> -- Keir
>
> On 15/10/2010 13:43, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
>
> >
> > Hi Keir:
> >
> > First, I'd like to express my appreciation for the help your offered
> > before.
> > Well, recently we confront a rather nasty domain 0 no response
> > problem.
> >
> > We still have 12 HVMs almost continuously and con currently reboot
> > test on a physical server.
> > A few hours later, the server looks like dead. We only can ping to
> > the server and get right response,
> > the Xen works fine since we can get debug info from serial port. Attached is
> > the full debug output.
> > After decode the domain 0 CPU stack, I find the CPU still works for domain 0
> > since the stack changed
> > info changed every time I dumped.
> >
> > Could help to take a look at the attentchment to see whether there are
> > some hints for debugging this
> > problem. Thanks in advance.
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > http://lists.xensource.com/xen-devel
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.