[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: DOM0 Hang on a large box....



>>> On 01.09.11 at 21:20, Mukesh Rathor <mukesh.rathor@xxxxxxxxxx> wrote:
> I'm looking at a system hang on a large box: 160 cpus, 2TB. Dom0 is
> booted with 160 vcpus (don't ask me why :)), and an HVM guest is started
> with over 1.5T RAM and 128 vcpus. The system hangs without much activity
> after couple hours. Xen 4.0.2 and 2.6.32 based 64bit dom0.
> 
> During hang I discovered:
> 
> Most of dom0 vcpus are in double_lock_balance spinning on one of the locks:
> 
> @ ffffffff800083aa: 0:hypercall_page+3aa           pop %r11                
> @ ffffffff802405eb: 0:xen_spin_wait+19b            test %eax, %eax        
> @ ffffffff8035969b: 0:_spin_lock+10b               test %al, %al          
> @ ffffffff800342f5: 0:double_lock_balance+65       mov %rbx, %rdi          
> @ ffffffff80356fc0: 0:thread_return+37e            mov 0x880(%r12), %edi   
> 
> static int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
>         __releases(this_rq->lock)
>         __acquires(busiest->lock)
>         __acquires(this_rq->lock)
> {
>         int ret = 0;
> 
>         if (unlikely(!spin_trylock(&busiest->lock))) {
>                 if (busiest < this_rq) {
>                         spin_unlock(&this_rq->lock);
>                         spin_lock(&busiest->lock);
>                         spin_lock_nested(&this_rq->lock, 
> SINGLE_DEPTH_NESTING);
>                         ret = 1;
>                 } else
>                         spin_lock_nested(&busiest->lock, 
> SINGLE_DEPTH_NESTING);
>         }
>         return ret;
> }
> 
> 
> The lock is taken, but not sure who the owner is. The lock struct:
> 
> @ ffff8800020e2480:  2f102e70 0000000c 00000002 00000000 
> 
> so slock is: 2f102e70
> 
> The remaining vcpus are idling:
> 
> ffffffff800083aa: 0:hypercall_page+3aa           pop %r11                
> ffffffff8000f0c7: 0:xen_safe_halt+f7             addq $0x18, %rsp        
> ffffffff8000a5c5: 0:cpu_idle+65                  jmp  0:cpu_idle+4e
> ffffffff803558fe: 0:cpu_bringup_and_idle+e       leave                   
> 
> But the baffling thing is the vcpu upcall mask is set. The block schedop 
> call 
> does local_event_delivery_enable() first thing, so the mask should be 
> clear!!!
> 
> 
> Another baffling thing is the dom0 upcall mask looks fishy:
> @ ffff83007f4dba00:  4924924924924929 2492492492492492
> @ ffff83007f4dba10:  9249249249249249 4924924924924924
> @ ffff83007f4dba20:  2492492492492492 9249249249249249
> @ ffff83007f4dba30:  4924924924924924 0000000092492492
> @ ffff83007f4dba40:  0000000000000000 0000000000000000
> @ ffff83007f4dba50:  0000000000000000 ffffffffc0000000
> @ ffff83007f4dba60:  ffffffffffffffff ffffffffffffffff
> @ ffff83007f4dba70:  ffffffffffffffff ffffffffffffffff 
> 
> 
> Finally, ticketing is used for spin locks. Hi Jan, what is the largest 
> system this was tested on? Have you seen this before?

>From the observation of most CPUs sitting in _double_lock_balance()
I would have answered yes, but the odd upcall mask I don't recall
having seen. In any case - is your Dom0 kernel (presumably derived
from ours) up-to-date? That problem I recall was fixed months ago.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.