Xen project Mailing List

Re: [Xen-devel] Race condition with scheduler runqueues

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 19 Feb 2013 11:47:32 +0000

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Xen-devel List <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 19 Feb 2013 11:48:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 19/02/13 09:28, Jan Beulich wrote: >>>> On 18.02.13 at 19:11, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: >> Hello, >> >> Our testing has discovered a crash (pagefault at 0x0000000000000008) >> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep() >> in sched_credit.c (because a static function of the same name also >> exists in sched_credit2.c, which confused matters to start with) >> >> The test case was a loop of localhost migrate of a 1vcpu HVM win8 >> domain. The test case itself has passed many times in the past on the >> same Xen codebase (Xen-4.1.3), indicating that it is very rare. There >> does not appear to be any relevant changes between the version of Xen in >> the test and xen-4.1-testing. >> >> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below) >> from dom0, targeting the VCPU of the migrating domain. >> >> (XEN) Xen call trace: >> (XEN) [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70 >> (XEN) 0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0 >> (XEN) 12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50 >> (XEN) 14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230 >> (XEN) 24[<ffff82c480102bc1>] do_domctl+0x731/0x1130 >> (XEN) 64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80 >> >> The relevant part of csched_vcpu_sleep() is >> >> else if ( __vcpu_on_runq(svc) ) >> __runq_remove(svc); >> >> which disassembles to >> >> ffff82c480116a01: 49 8b 10 mov (%r8),%rdx >> ffff82c480116a04: 4c 39 c2 cmp %r8,%rdx >> ffff82c480116a07: 75 07 jne ffff82c480116a10 >> <csched_vcpu_sleep+0x40> >> ffff82c480116a09: f3 c3 repz retq >> ffff82c480116a0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) >> ffff82c480116a10: 49 8b 40 08 mov 0x8(%r8),%rax >> ffff82c480116a14: 48 89 42 08 mov %rax,0x8(%rdx) # >> <- Pagefault here >> ffff82c480116a18: 48 89 10 mov %rdx,(%rax) >> ffff82c480116a1b: 4d 89 40 08 mov %r8,0x8(%r8) >> ffff82c480116a1f: 4d 89 00 mov %r8,(%r8) >> >> The relevant crash registers from the pagefault are: >> rax: 0000000000000000 >> rdx: 0000000000000000 >> r8: ffff83080c89ed90 >> >> If I am reading the code correctly, this means that runq->next is NULL, >> so we fail list_empty() and erroneously pass __vcpu_on_runq(). We then >> fail with a fault when trying to update runq->prev, which is also NULL. >> >> The only place I can spot in the code where the runq->{next,prev} could >> conceivably be NULL is in csched_alloc_vdata() between the memset() and >> INIT_LIST_HEAD(). This is logically sensible in combination with the >> localhost migrate loop, and I cant immediately see anything to prevent >> this race happening. > But that doesn't make sense: csched_alloc_vdata() doesn't store > svc into vc->sched_priv; that's being done by the generic > scheduler code once the actor returns. D'oh yes - I overlooked that. > > So I'd rather suspect a stale pointer being used, which is easily > possible when racing with sched_move_domain() (as opposed to > schedule_cpu_switch(), where the new pointer gets stored > _before_ de-allocating the old one). > > However, sched_move_domain() (as well as schedule_cpu_switch()) > get called only from CPU pools code, and I would guess CPU pools > aren't involved here, and you don't in parallel soft offline/online > pCPU-s (as I'm sure you otherwise would have mentioned it). > > But wait - libxl__domain_make() _unconditionally_ calls > xc_cpupool_movedomain(), as does XendDomainInfo's > _constructDomain(). The reason for this escapes me - JÃrgen? Yet > I'd expect the pool ID matching check to short cut the resulting > sysctl, i.e. sched_move_domain() ought to not be reached anyway > (worth verifying of course). > > The race there nevertheless ought to be fixed. > > Jan Our toolstack hooks directly into libxc and is completely ignorant of cpupools. Looking at the crash more closely, it might be a race elsewhere Another dom0 vcpu is in an HVMOP_track_dirty_vram hypercall, and the associated Xen stack trace is [ffff82c4801777b2] time_calibration_std_rendezvous+0xb2/0xc0 ffff82c480172d12 __smp_call_function_interrupt+0x62/0xb0 ffff82c48017339e smp_call_function_interrupt+0x4e/0x90 ffff82c48014a65a call_function_interrupt+0x2a/0x30 ffff82c4801223b2 _spin_lock+0x12/0x20 ffff82c4801734ab flush_area_mask+0xcb/0x1c0 ffff82c4801c862a paging_log_dirty_range+0x3a/0x780 ffff82c480121ea8 cpumask_raise_softirq+0x78/0x80 ffff82c480117ed3 csched_vcpu_wake+0x193/0x420 ffff82c48014a5fa event_check_interrupt+0x2a/0x30 ffff82c4801f21c7 hap_track_dirty_vram+0x137/0x1c0 ffff82c4801ad3fd do_hvm_op+0x16dd/0x1f50 ffff82c480106251 evtchn_send+0xa1/0x160 ffff82c480106d36 do_event_channel_op+0x876/0xfd0 ffff82c4801f9027 compat_update_descriptor+0x27/0x30 ffff82c4801354f8 compat_multicall+0x198/0x380 ffff82c48014a5fa event_check_interrupt+0x2a/0x30 ffff82c4802013c4 compat_hypercall+0x74/0x80 the hap_track_dirty_vram() and paging_log_dirty_range() are part of the same logical call trace, but it appears that we have taken an event_check_interrupt() in the middle and called schedule() off the bottom of it, calling csched_vcpu_wake(). I am currently trying to reason as to whether it is possible for a race between csched_vcpu_{sleep,wake}() could result in the seen crash, but it certainly looks like a smoking gun. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.