[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Race condition with scheduler runqueues



On 19/02/13 09:28, Jan Beulich wrote:
>>>> On 18.02.13 at 19:11, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>> Hello,
>>
>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>> in sched_credit.c (because a static function of the same name also
>> exists in sched_credit2.c, which confused matters to start with)
>>
>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>> domain.  The test case itself has passed many times in the past on the
>> same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
>> does not appear to be any relevant changes between the version of Xen in
>> the test and xen-4.1-testing.
>>
>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>> from dom0, targeting the VCPU of the migrating domain.
>>
>> (XEN) Xen call trace:
>> (XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>> (XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>> (XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>> (XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>> (XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>> (XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>
>> The relevant part of csched_vcpu_sleep() is
>>
>>     else if ( __vcpu_on_runq(svc) )
>>         __runq_remove(svc);
>>
>> which disassembles to
>>
>> ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
>> ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
>> ffff82c480116a07:       75 07                   jne    ffff82c480116a10
>> <csched_vcpu_sleep+0x40>
>> ffff82c480116a09:       f3 c3                   repz retq
>> ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>> ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
>> ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
>> <- Pagefault here
>> ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
>> ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
>> ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)
>>
>> The relevant crash registers from the pagefault are:
>> rax: 0000000000000000
>> rdx: 0000000000000000
>>  r8: ffff83080c89ed90
>>
>> If I am reading the code correctly, this means that runq->next is NULL,
>> so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
>> fail with a fault when trying to update runq->prev, which is also NULL.
>>
>> The only place I can spot in the code where the runq->{next,prev} could
>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>> INIT_LIST_HEAD().  This is logically sensible in combination with the
>> localhost migrate loop, and I cant immediately see anything to prevent
>> this race happening.
> But that doesn't make sense: csched_alloc_vdata() doesn't store
> svc into vc->sched_priv; that's being done by the generic
> scheduler code once the actor returns.

D'oh yes - I overlooked that.

>
> So I'd rather suspect a stale pointer being used, which is easily
> possible when racing with sched_move_domain() (as opposed to
> schedule_cpu_switch(), where the new pointer gets stored
> _before_ de-allocating the old one).

>
> However, sched_move_domain() (as well as schedule_cpu_switch())
> get called only from CPU pools code, and I would guess CPU pools
> aren't involved here, and you don't in parallel soft offline/online
> pCPU-s (as I'm sure you otherwise would have mentioned it).
>
> But wait - libxl__domain_make() _unconditionally_ calls
> xc_cpupool_movedomain(), as does XendDomainInfo's
> _constructDomain(). The reason for this escapes me - JÃrgen? Yet
> I'd expect the pool ID matching check to short cut the resulting
> sysctl, i.e. sched_move_domain() ought to not be reached anyway
> (worth verifying of course).
>
> The race there nevertheless ought to be fixed.
>
> Jan

Our toolstack hooks directly into libxc and is completely ignorant of
cpupools.  Looking at the crash more closely, it might be a race elsewhere

Another dom0 vcpu is in an HVMOP_track_dirty_vram hypercall, and the
associated Xen stack trace is

[ffff82c4801777b2] time_calibration_std_rendezvous+0xb2/0xc0
 ffff82c480172d12  __smp_call_function_interrupt+0x62/0xb0
 ffff82c48017339e  smp_call_function_interrupt+0x4e/0x90
 ffff82c48014a65a  call_function_interrupt+0x2a/0x30
 ffff82c4801223b2  _spin_lock+0x12/0x20
 ffff82c4801734ab  flush_area_mask+0xcb/0x1c0
 ffff82c4801c862a  paging_log_dirty_range+0x3a/0x780
 ffff82c480121ea8  cpumask_raise_softirq+0x78/0x80
 ffff82c480117ed3  csched_vcpu_wake+0x193/0x420
 ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
 ffff82c4801f21c7  hap_track_dirty_vram+0x137/0x1c0
 ffff82c4801ad3fd  do_hvm_op+0x16dd/0x1f50
 ffff82c480106251  evtchn_send+0xa1/0x160
 ffff82c480106d36  do_event_channel_op+0x876/0xfd0
 ffff82c4801f9027  compat_update_descriptor+0x27/0x30
 ffff82c4801354f8  compat_multicall+0x198/0x380
 ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
 ffff82c4802013c4  compat_hypercall+0x74/0x80

the hap_track_dirty_vram() and paging_log_dirty_range() are part of the
same logical call trace, but it appears that we have taken an
event_check_interrupt() in the middle and called schedule() off the
bottom of it, calling csched_vcpu_wake().

I am currently trying to reason as to whether it is possible for a race
between csched_vcpu_{sleep,wake}() could result in the seen crash, but
it certainly looks like a smoking gun.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.