[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Hit ASSERT in credit2 code with NR_CPUS=1 build
On Tue, 2021-03-09 at 17:24 +0100, Roger Pau Monné wrote: > Hello, > Hey, > While looking at the NR_CPUS == 1 build I realized I could reliable > trigger the following ASSERT by creating a guest (note that dom0 > seems > to be fine): > Yes, I'm (somewhat, not sure if exactly though) able to reproduce. > (XEN) Assertion 'i != cpu' failed at credit2.c:1725 > (XEN) ----[ Xen-4.15.0-rc x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 0 > (XEN) RIP: e008:[<ffff82d040249399>] > common/sched/credit2.c#runq_tickle+0x469/0x571 > (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor (d4v0) > (XEN) rax: ffffffffffffffff rbx: 0000000000000000 rcx: > 0000000000000000 > (XEN) rdx: ffff83086c62feb0 rsi: 0000012774fba66c rdi: > ffff8307e11d5d40 > (XEN) rbp: ffff83008c8c7cf8 rsp: ffff83008c8c7c68 r8: > ffff83086c66d6c0 > (XEN) r9: ffff82d0405d1218 r10: 0000000000000000 r11: > ffff83086c631000 > (XEN) r12: ffff83086c6437c0 r13: 0000000000000000 r14: > ffff83086c62fe20 > (XEN) r15: ffff82d0405d0320 cr0: 0000000080050033 cr4: > 00000000003526e0 > (XEN) cr3: 00000007e130d000 cr2: ffff88826910cb38 > (XEN) fsb: 00007efee038b780 gsb: ffff888273400000 gss: > 0000000000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen code around <ffff82d040249399> > (common/sched/credit2.c#runq_tickle+0x469/0x571): > (XEN) ac ff 75 3d 0f 0b 0f 0b <0f> 0b c7 45 ac 00 00 00 00 48 8d 05 > 6f 7e 38 00 > (XEN) Xen stack trace from rsp=ffff83008c8c7c68: > [...] > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) Assertion 'i != cpu' failed at credit2.c:1725 > (XEN) **************************************** > Interesting... So, how do cpumasks look like/work, with NR_CPUS=1 (sorry, I couldn't follow all the aspects of it too closely) ? I'm asking because, what we're doing here is the following. First of all we put together a cpumask (in `mask`) out of the intersection of the CPUs that are in the vcpu's hard/soft affinity, are part of this runqueue, are idle and have not been tickled (where tickled == they've been poked and will go through schedule() soon): cpumask_andnot(&mask, &rqd->active, &rqd->idle); cpumask_andnot(&mask, &mask, &rqd->tickled); cpumask_and(&mask, &mask, cpumask_scratch_cpu(cpu)); Now, I would very much expect for `mask` to have at most one bit set (i.e., the one of our only CPU). Actually, considering how unlikely it would be that our only CPU is both idle and not-tickled, I expect mask to be empty most of the times. Anyway, let's say the cpumask has 1 bit set (in which case, it must be the one associated to CPU 0, I presume?). What we do now is this: if ( __cpumask_test_and_clear_cpu(cpu, &mask) ) { ... } Which I think means that, no matter whether or not we enter the loop, we clear the bit. Of course, which bit depends on the value of `cpu`... But with NR_CPUS=1, I don't see how `cpu` can have a value different than the ID of the one and only CPU we have. So, in my mind, now `mask` is empty. Therefore, I'm currently clueless about why we enter this loop... > for_each_cpu(i, &mask) > { > s_time_t score; > > /* Already looked at this one above */ > ASSERT(i != cpu); <==== > ... and we reach this point. I tried to build staging here (with NR_CPUS=1), and I think the code for this ASSERT(), for me, is: test %ebx,%ebx je ffff82d040245ac5 <runq_tickle+0x48a> (and ffff82d040245ac5 is of course ud2.) Snf this kind of makes sense. Or, at least, I'm not surprised that, if we are inside this loop, `i` is actually equal to `cpu`. What I'm surprised about is that we are inside the loop in the first place... I guess I need to think more about it. Any bright ideas that explain what is going on would be more than appreciated. > In runq_tickle. I'm afraid I have no clue of what's going on. FTR > using a non-debug build with NR_CPUS == 1 does seem to work fine and > I > don't see any ill effects. > Well, yes, special casing `cpu` and dealing with it outside of the loop is just an optimization, for when soft-affinity is defined for the vcpu. So it makes sense that things work without the ASSERT(). However, the ASSERT() was there as a consistency check, and it looks to me to be a valid one, even with NR_CPUS=1, so I really don't know why it triggers... Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere) Attachment:
signature.asc
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |