[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin



On Tue, 2018-04-10 at 09:34 +0000, George Dunlap wrote:
> Assuming the bug is this one:
> 
> BUG_ON( cpu != snext->vcpu->processor );
> 
Yes, it is that one.

Another stack trace, this time from a debug=y built hypervisor, of what
we are thinking it is the same bug (although reproduced in a slightly
different way) is this:

(XEN) ----[ Xen-4.7.2_02-36.1.12847.11.PTF  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    45
(XEN) RIP:    e008:[<ffff82d08012508f>] 
sched_credit.c#csched_schedule+0x361/0xaa9
...
(XEN) Xen call trace:
(XEN)    [<ffff82d08012508f>] sched_credit.c#csched_schedule+0x361/0xaa9
(XEN)    [<ffff82d08012c233>] schedule.c#schedule+0x109/0x5d6
(XEN)    [<ffff82d08012fb5f>] softirq.c#__do_softirq+0x7f/0x8a
(XEN)    [<ffff82d08012fbb4>] do_softirq+0x13/0x15
(XEN)    [<ffff82d0801fd5c5>] vmx_asm_do_vmentry+0x25/0x2a

(I can provide it all, if necessary.)

I've done some analysis, although when we still were not entirely sure
that changing the affinities was the actual cause (or, at least, what
is triggering the whole thing).

In the specific case of this stack trace, the current vcpu running on
CPU 45 is d3v11. It is not in the runqueue, because it has been
removed, and not added back to it, and the reason is it is not runnable
(it has VPF_migrating on in pause_flags).

The runqueue of pcpu 45 looks fine (i.e., it is not corrupt or anything
like that), it has d3v10,d9v1,d32767v45 in it (in this order)

d3v11->processor is 45, so that is also fine.

Basically, d3v11 wants to move away from pcpu 45, and this might (but
that's not certain) be the reson because we're rescheduling. The fact
that there are vcpus wanting to migrate can very well be the cause of
affinity being changed.

Now, the problem is that, looking into the runqueue, I found out that
d3v10->processor=32. I.e., d3v10 is queued in pcpu 45's runqueue, with
processor=32, which really shouldn't happen.

This leads to the bug triggering, as, in csched_schedule(), we read the
head of the runqueue with:

snext = __runq_elem(runq->next);

and then we pass snext to csched_load_balance(), where the BUG_ON is.

Another thing that I've found out, is that all "misplaced" vcpus (i.e.,
in this and also in other manifestations of this bug) have their
csched_vcpu.flags=4, which is CSCHED_FLAGS_VCPU_MIGRATING.

This, basically, is again a sign of vcpu_migrate() having been called,
on d3v10 as well, which in turn has called csched_vcpu_pick().

> a nasty race condition… a vcpu has just been taken off the runqueue
> of the current pcpu, but it’s apparently been assigned to a different
> cpu.
> 
Nasty indeed. I've been looking into this on and off, but so far I
haven't found the root cause.

Now that we know for sure that it is changing affinity that trigger it,
the field of the investigation can be narrowed a little bit... But I
still am finding hard to spot where the race happens.

I'll look more into this later in the afternoon. I'll let know if
something comes to mind.

> Let me take a look.
> 
Thanks! :-)
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.