Xen project Mailing List

[Xen-devel] Re: DOM0 Hang on a large box....

To: "Mukesh Rathor" <mukesh.rathor@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Mon, 05 Sep 2011 13:26:42 +0100

Cc: "Xen-devel@xxxxxxxxxxxxxxxxxxx" <Xen-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 05 Sep 2011 05:27:18 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

>>> On 01.09.11 at 21:20, Mukesh Rathor <mukesh.rathor@xxxxxxxxxx> wrote: > I'm looking at a system hang on a large box: 160 cpus, 2TB. Dom0 is > booted with 160 vcpus (don't ask me why :)), and an HVM guest is started > with over 1.5T RAM and 128 vcpus. The system hangs without much activity > after couple hours. Xen 4.0.2 and 2.6.32 based 64bit dom0. > > During hang I discovered: > > Most of dom0 vcpus are in double_lock_balance spinning on one of the locks: > > @ ffffffff800083aa: 0:hypercall_page+3aa pop %r11 > @ ffffffff802405eb: 0:xen_spin_wait+19b test %eax, %eax > @ ffffffff8035969b: 0:_spin_lock+10b test %al, %al > @ ffffffff800342f5: 0:double_lock_balance+65 mov %rbx, %rdi > @ ffffffff80356fc0: 0:thread_return+37e mov 0x880(%r12), %edi > > static int _double_lock_balance(struct rq *this_rq, struct rq *busiest) > __releases(this_rq->lock) > __acquires(busiest->lock) > __acquires(this_rq->lock) > { > int ret = 0; > > if (unlikely(!spin_trylock(&busiest->lock))) { > if (busiest < this_rq) { > spin_unlock(&this_rq->lock); > spin_lock(&busiest->lock); > spin_lock_nested(&this_rq->lock, > SINGLE_DEPTH_NESTING); > ret = 1; > } else > spin_lock_nested(&busiest->lock, > SINGLE_DEPTH_NESTING); > } > return ret; > } > > > The lock is taken, but not sure who the owner is. The lock struct: > > @ ffff8800020e2480: 2f102e70 0000000c 00000002 00000000 > > so slock is: 2f102e70 > > The remaining vcpus are idling: > > ffffffff800083aa: 0:hypercall_page+3aa pop %r11 > ffffffff8000f0c7: 0:xen_safe_halt+f7 addq $0x18, %rsp > ffffffff8000a5c5: 0:cpu_idle+65 jmp 0:cpu_idle+4e > ffffffff803558fe: 0:cpu_bringup_and_idle+e leave > > But the baffling thing is the vcpu upcall mask is set. The block schedop > call > does local_event_delivery_enable() first thing, so the mask should be > clear!!! > > > Another baffling thing is the dom0 upcall mask looks fishy: > @ ffff83007f4dba00: 4924924924924929 2492492492492492 > @ ffff83007f4dba10: 9249249249249249 4924924924924924 > @ ffff83007f4dba20: 2492492492492492 9249249249249249 > @ ffff83007f4dba30: 4924924924924924 0000000092492492 > @ ffff83007f4dba40: 0000000000000000 0000000000000000 > @ ffff83007f4dba50: 0000000000000000 ffffffffc0000000 > @ ffff83007f4dba60: ffffffffffffffff ffffffffffffffff > @ ffff83007f4dba70: ffffffffffffffff ffffffffffffffff > > > Finally, ticketing is used for spin locks. Hi Jan, what is the largest > system this was tested on? Have you seen this before? >From the observation of most CPUs sitting in _double_lock_balance() I would have answered yes, but the odd upcall mask I don't recall having seen. In any case - is your Dom0 kernel (presumably derived from ours) up-to-date? That problem I recall was fixed months ago. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.