[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
Hi all, On 08/01/2020 23:14, Julien Grall wrote: On Wed, 8 Jan 2020 at 21:40, osstest service owner <osstest-admin@xxxxxxxxxxxxxx> wrote:flight 145796 xen-unstable real [real] http://logs.test-lab.xenproject.org/osstest/logs/145796/ Failures :-/ but no regressions. Tests which are failing intermittently (not blocking): test-amd64-amd64-xl-rtds 15 guest-saverestore fail in 145773 pass in 145796 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 16 guest-start/debianhvm.repeat fail in 145773 pass in 145796 test-armhf-armhf-xl-rtds 12 guest-start fail in 145773 pass in 145796It looks like this test has been failing for a while (although not reliably). I looked at a few flights, the cause seems to be the same: Jan 8 15:02:14.700784 (XEN) Assertion '!unit_on_replq(svc)' failed at sched_rt.c:586 Jan 8 15:02:26.715030 (XEN) ----[ Xen-4.14-unstable arm32 debug=y Not tainted ]---- Jan 8 15:02:26.720756 (XEN) CPU: 1 Jan 8 15:02:26.722158 (XEN) PC: 0023a750 common/sched_rt.c#replq_insert+0x7c/0xcc Jan 8 15:02:26.727851 (XEN) CPSR: 200300da MODE:Hypervisor Jan 8 15:02:26.731334 (XEN) R0: 002a51a4 R1: 400614a0 R2: 3d64b900 R3: 40061338 Jan 8 15:02:26.736830 (XEN) R4: 400614a0 R5: 002a51a4 R6: 3cf1cbf0 R7: 000001cb Jan 8 15:02:26.742600 (XEN) R8: 4003d1b0 R9: 400614a8 R10:4003d1b0 R11:400ffe54 R12:400ffde4 Jan 8 15:02:26.749119 (XEN) HYP: SP: 400ffe2c LR: 0023b6e8 Jan 8 15:02:26.752296 (XEN) Jan 8 15:02:26.753036 (XEN) VTCR_EL2: 80003558 Jan 8 15:02:26.755479 (XEN) VTTBR_EL2: 00020000bbff4000 Jan 8 15:02:26.758757 (XEN) Jan 8 15:02:26.759366 (XEN) SCTLR_EL2: 30cd187f Jan 8 15:02:26.761755 (XEN) HCR_EL2: 0078663f Jan 8 15:02:26.764250 (XEN) TTBR0_EL2: 00000000bc029000 Jan 8 15:02:26.767364 (XEN) Jan 8 15:02:26.767980 (XEN) ESR_EL2: 00000000 Jan 8 15:02:26.770485 (XEN) HPFAR_EL2: 00030010 Jan 8 15:02:26.772795 (XEN) HDFAR: e0800f00 Jan 8 15:02:26.775272 (XEN) HIFAR: c0605744 Jan 8 15:02:26.777748 (XEN) Jan 8 15:02:26.778505 (XEN) Xen stack trace from sp=400ffe2c: Jan 8 15:02:26.781910 (XEN) 00000000 3cf1cbf0 400614a0 002a51a4 3cf1cbf0 000001cb 4003d1b0 6003005a Jan 8 15:02:26.788991 (XEN) 400613f8 400ffe7c 0023b6e8 002f9300 4004c000 400613f8 3cf1cbf0 000001cb Jan 8 15:02:26.796093 (XEN) 4003d1b0 6003005a 400613f8 400ffeac 00242988 4004c000 002425ac 40058000 Jan 8 15:02:26.803237 (XEN) 4004c000 4004f000 10f45000 10f45008 4004b080 40058000 60030013 400ffebc Jan 8 15:02:26.810360 (XEN) 00209984 00000002 4004f000 400ffedc 0020eddc 0020caf8 db097cd4 00000020 Jan 8 15:02:26.817504 (XEN) c13afbec 00000000 db15fd68 400ffee4 0020c9dc 400fff34 0020d5e8 4004e000 Jan 8 15:02:26.824615 (XEN) 00000000 400fff44 400fff44 00000002 00000000 4004e8fa 4004e8f4 400fff1c Jan 8 15:02:26.831737 (XEN) 400fff1c 6003005a 0020caf8 400fff58 00000020 c13afbec 00000000 db15fd68 Jan 8 15:02:26.838798 (XEN) 60030013 400fff54 0026c150 c1204d08 c13afbec 00000000 00000000 00000000 Jan 8 15:02:26.845877 (XEN) 00000002 400fff58 002753b0 00000009 db097cd4 db173008 00000002 c1204d08 Jan 8 15:02:26.852986 (XEN) 00000000 00000002 c13afbec 00000000 db15fd68 60030013 db15fd3c 00000020 Jan 8 15:02:26.860044 (XEN) ffffffff b6cdccb3 c0107ed0 a0030093 4a000ea1 be951568 c136edc0 c010d3a0 Jan 8 15:02:26.867171 (XEN) db097cd0 c056c7f8 c136edcc c010d720 c136edd8 c010d7e0 00000000 00000000 Jan 8 15:02:26.874526 (XEN) 00000000 00000000 00000000 c136ede4 c136ede4 00030030 60070193 80030093 Jan 8 15:02:26.881450 (XEN) 60030193 00000000 00000000 00000000 00000001 Jan 8 15:02:26.886519 (XEN) Xen call trace: Jan 8 15:02:26.888168 (XEN) [<0023a750>] common/sched_rt.c#replq_insert+0x7c/0xcc (PC) Jan 8 15:02:26.894240 (XEN) [<0023b6e8>] common/sched_rt.c#rt_unit_wake+0xf4/0x274 (LR) Jan 8 15:02:26.900246 (XEN) [<0023b6e8>] common/sched_rt.c#rt_unit_wake+0xf4/0x274 Jan 8 15:02:26.905775 (XEN) [<00242988>] vcpu_wake+0x1e4/0x688 Jan 8 15:02:26.909743 (XEN) [<00209984>] domain_unpause+0x64/0x84 Jan 8 15:02:26.913956 (XEN) [<0020eddc>] common/event_fifo.c#evtchn_fifo_unmask+0xd8/0xf0 Jan 8 15:02:26.920167 (XEN) [<0020c9dc>] evtchn_unmask+0x7c/0xc0 Jan 8 15:02:26.924173 (XEN) [<0020d5e8>] do_event_channel_op+0xaf0/0xdac Jan 8 15:02:26.928922 (XEN) [<0026c150>] do_trap_guest_sync+0x350/0x4d0 Jan 8 15:02:26.933647 (XEN) [<002753b0>] entry.o#return_from_trap+0/0x4 Jan 8 15:02:26.938299 (XEN) Jan 8 15:02:26.939039 (XEN) Jan 8 15:02:26.939668 (XEN) **************************************** Jan 8 15:02:26.943794 (XEN) Panic on CPU 1: Jan 8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at sched_rt.c:586 Jan 8 15:02:26.951492 (XEN) **************************************** I believe the domain_unpause() is coming from guest_clear_bit(). This would mean the atomics didn't succeed without pausing the domain. This makes sense as, per the log: CPU1: Guest atomics will try 1 times before pausing the domain I am under the impression that the crash could be reproduced with just: domain_pause_nosync(current); domain_unpause(current); Any insights what's wrong? I am happy to try to reproduce it tomorrow morning. So I managed to reproduce it on Arm by hacking the hypercall path to call: domain_pause_nosync(current->domain); domain_unpause(current->domain);With a debug build and with a 2 vCPU dom0 the crash happen in a few seconds. When the unit is not scheduled, rt_unit_wake() expects the unit to be in none of the queues. The interaction is as following: CPU0 | CPU1 | do_domain_pause() | -> atomic_inc(&d->pause_count) | -> vcpu_sleep_nosync(vCPU A) | schedule() | -> Lock | -> rt_schedule() | -> snext = runq_pick(...) | /* return unit A (aka vCPU A) | /* Unit is not runnable */ | -> Remove from the q | [....] | -> Lock -> Lock | -> rt_unit_sleep() | /* Unit not scheduled */ | /* Nothing to do */ | Note that on Arm, each vCPU has its own scheduling unit.When schedule() grab the lock first (as shown above), the unit will only be removed from the Q. However, when vcpu_sleep_nosync() grab the lock first and the unit was not scheduled, rt_unit_sleep() will remove the unit from two queues (runQ/depleteQ and replenishQ). So I think we want schedule() to remove the unit from the 2 queues if it is not runnable. Any opinions? Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |