[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Race condition with scheduler runqueues

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Tue, 26 Feb 2013 12:08:14 +0000
Cc: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Xen-devel List <xen-devel@xxxxxxxxxxxxx>
Delivery-date: Tue, 26 Feb 2013 12:09:13 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 02/19/2013 11:47 AM, Andrew Cooper wrote:

On 19/02/13 09:28, Jan Beulich wrote:

On 18.02.13 at 19:11, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:

Hello,

Our testing has discovered a crash (pagefault at 0x0000000000000008)
which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
in sched_credit.c (because a static function of the same name also
exists in sched_credit2.c, which confused matters to start with)

The test case was a loop of localhost migrate of a 1vcpu HVM win8
domain.  The test case itself has passed many times in the past on the
same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
does not appear to be any relevant changes between the version of Xen in
the test and xen-4.1-testing.

The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
from dom0, targeting the VCPU of the migrating domain.

(XEN) Xen call trace:
(XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
(XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
(XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
(XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
(XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
(XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80

The relevant part of csched_vcpu_sleep() is

     else if ( __vcpu_on_runq(svc) )
         __runq_remove(svc);

which disassembles to

ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
ffff82c480116a07:       75 07                   jne    ffff82c480116a10
<csched_vcpu_sleep+0x40>
ffff82c480116a09:       f3 c3                   repz retq
ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
<- Pagefault here
ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)

The relevant crash registers from the pagefault are:
rax: 0000000000000000
rdx: 0000000000000000
  r8: ffff83080c89ed90

If I am reading the code correctly, this means that runq->next is NULL,
so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
fail with a fault when trying to update runq->prev, which is also NULL.

The only place I can spot in the code where the runq->{next,prev} could
conceivably be NULL is in csched_alloc_vdata() between the memset() and
INIT_LIST_HEAD().  This is logically sensible in combination with the
localhost migrate loop, and I cant immediately see anything to prevent
this race happening.

But that doesn't make sense: csched_alloc_vdata() doesn't store
svc into vc->sched_priv; that's being done by the generic
scheduler code once the actor returns.


D'oh yes - I overlooked that.


So I'd rather suspect a stale pointer being used, which is easily
possible when racing with sched_move_domain() (as opposed to
schedule_cpu_switch(), where the new pointer gets stored
_before_ de-allocating the old one).


However, sched_move_domain() (as well as schedule_cpu_switch())
get called only from CPU pools code, and I would guess CPU pools
aren't involved here, and you don't in parallel soft offline/online
pCPU-s (as I'm sure you otherwise would have mentioned it).

But wait - libxl__domain_make() _unconditionally_ calls
xc_cpupool_movedomain(), as does XendDomainInfo's
_constructDomain(). The reason for this escapes me - JÃrgen? Yet
I'd expect the pool ID matching check to short cut the resulting
sysctl, i.e. sched_move_domain() ought to not be reached anyway
(worth verifying of course).

The race there nevertheless ought to be fixed.

Jan


Our toolstack hooks directly into libxc and is completely ignorant of
cpupools.  Looking at the crash more closely, it might be a race elsewhere

Another dom0 vcpu is in an HVMOP_track_dirty_vram hypercall, and the
associated Xen stack trace is

[ffff82c4801777b2] time_calibration_std_rendezvous+0xb2/0xc0
  ffff82c480172d12  __smp_call_function_interrupt+0x62/0xb0
  ffff82c48017339e  smp_call_function_interrupt+0x4e/0x90
  ffff82c48014a65a  call_function_interrupt+0x2a/0x30
  ffff82c4801223b2  _spin_lock+0x12/0x20
  ffff82c4801734ab  flush_area_mask+0xcb/0x1c0
  ffff82c4801c862a  paging_log_dirty_range+0x3a/0x780
  ffff82c480121ea8  cpumask_raise_softirq+0x78/0x80
  ffff82c480117ed3  csched_vcpu_wake+0x193/0x420
  ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
  ffff82c4801f21c7  hap_track_dirty_vram+0x137/0x1c0
  ffff82c4801ad3fd  do_hvm_op+0x16dd/0x1f50
  ffff82c480106251  evtchn_send+0xa1/0x160
  ffff82c480106d36  do_event_channel_op+0x876/0xfd0
  ffff82c4801f9027  compat_update_descriptor+0x27/0x30
  ffff82c4801354f8  compat_multicall+0x198/0x380
  ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
  ffff82c4802013c4  compat_hypercall+0x74/0x80

the hap_track_dirty_vram() and paging_log_dirty_range() are part of the
same logical call trace, but it appears that we have taken an
event_check_interrupt() in the middle and called schedule() off the
bottom of it, calling csched_vcpu_wake().

I am currently trying to reason as to whether it is possible for a race
between csched_vcpu_{sleep,wake}() could result in the seen crash, but
it certainly looks like a smoking gun.


Any more progress on this one?

In theory all of those should be made mutually exclusive by holding thelock of the runqueue on which the VCPU is running.

Any chance there's a race with the assignment of the vcpu -- that is, arace in vcpu_schedule_lock() such that someone ends up grabbing thewrong lock?

I think that in theory once you call INIT_LIST_HEAD, none of thosepointers should ever be set to zero; if one ever were it might getpassed around a bit before actually being followed. Any chance there'ssomething uninitialized somewhere?

And of course, if all else fails, there's good old-fashioned memoryclobbering as a possibility...


 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Race condition with scheduler runqueues
  - From: Andrew Cooper

References:
- [Xen-devel] Race condition with scheduler runqueues
  - From: Andrew Cooper
- Re: [Xen-devel] Race condition with scheduler runqueues
  - From: Jan Beulich
- Re: [Xen-devel] Race condition with scheduler runqueues
  - From: Andrew Cooper

Prev by Date: Re: [Xen-devel] [PATCH] credit: track residual from divisions done during accounting
Next by Date: Re: [Xen-devel] [xen-unstable test] 16231: regressions - FAIL
Previous by thread: Re: [Xen-devel] Race condition with scheduler runqueues
Next by thread: Re: [Xen-devel] Race condition with scheduler runqueues
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.