Xen project Mailing List

Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin

To: Olaf Hering <olaf@xxxxxxxxx>

From: George Dunlap <George.Dunlap@xxxxxxxxxx>

Date: Tue, 10 Apr 2018 09:34:31 +0000

Accept-language: en-GB, en-US

Cc: Dario Faggioli <dfaggioli@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 10 Apr 2018 09:34:37 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHT0Kn8h9q9Rs4uXU6WSZ84IkzkxKP5m4eA

Thread-topic: crash in csched_load_balance after xl vcpu-pin

> On Apr 10, 2018, at 9:57 AM, Olaf Hering <olaf@xxxxxxxxx> wrote: > > While hunting some other bug we run into the single BUG in > sched_credit.c:csched_load_balance(). This happens with all versions > since 4.7, staging is also affected. Testsystem is a Haswell model 63 > system with 4 NUMA nodes and 144 threads. > > (XEN) Xen BUG at sched_credit.c:1694 > (XEN) ----[ Xen-4.11.20180407T144959.e62e140daa-2.bug1087289_411 x86_64 > debug=n Not tainted ]---- > (XEN) CPU: 30 > (XEN) RIP: e008:[<ffff82d08022879d>] > sched_credit.c#csched_schedule+0xaad/0xba0 > (XEN) RFLAGS: 0000000000010087 CONTEXT: hypervisor > (XEN) rax: ffff83077ffe76d0 rbx: ffff83077fe571d0 rcx: 000000000000001e > (XEN) rdx: ffff83005d082000 rsi: 0000000000000000 rdi: ffff83077fe575b0 > (XEN) rbp: ffff82d08094a480 rsp: ffff83077fe4fd00 r8: ffff83077fe581a0 > (XEN) r9: ffff82d080227cf0 r10: 0000000000000000 r11: ffff830060b62060 > (XEN) r12: 000014f4e864c2d4 r13: ffff83077fe575b0 r14: ffff83077fe58180 > (XEN) r15: ffff82d08094a480 cr0: 000000008005003b cr4: 00000000001526e0 > (XEN) cr3: 0000000049416000 cr2: 00007fb24e1b7277 > (XEN) fsb: 0000000000000000 gsb: 0000000000000000 gss: 0000000000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen code around <ffff82d08022879d> > (sched_credit.c#csched_schedule+0xaad/0xba0): > (XEN) 18 01 00 e9 73 f7 ff ff <0f> 0b 48 8b 43 28 be 01 00 00 00 bf 0a 20 02 > 00 > (XEN) Xen stack trace from rsp=ffff83077fe4fd00: > (XEN) ffff82d0803577ef 0000001e00000000 80000000803577ef ffff830f9d5b2aa0 > (XEN) ffff82d0803577ef ffff83077a6c59e0 ffff83077fe4fe38 ffff82d0803577fb > (XEN) 0000000000000000 0000000000000000 0000000001c9c380 0000000000000000 > (XEN) ffff83077fe4ffff 000000000000001e 000014f4e86c885e ffff83077fe4ffff > (XEN) ffff82d08094a480 000014f4e86c73be 0000000080230c80 ffff830060b38000 > (XEN) ffff83077fe58300 0000000000000046 ffff830f9d4f6018 0000000000000082 > (XEN) 000000000000001e ffff83077fe581c8 0000000000000001 000000000000001e > (XEN) ffff83005d1f0000 ffff83077fe58188 000014f4e86c885e ffff83077fe58180 > (XEN) ffff82d08094a480 ffff82d08023153d ffff830700000000 ffff83077fe581a0 > (XEN) 0000000000000206 ffff82d080268705 ffff83077fe58300 ffff830060b38060 > (XEN) ffff830845d83010 ffff82d080238578 ffff83077fe4ffff 00000000ffffffff > (XEN) ffffffffffffffff ffff83077fe4ffff ffff82d080933c00 ffff82d08094a480 > (XEN) ffff83077fe4ffff ffff82d080234cb2 ffff82d08095f1f0 ffff82d080934b00 > (XEN) ffff82d08095f1f0 000000000000001e 000000000000001e ffff82d08026daf5 > (XEN) ffff83005d1f0000 ffff83005d1f0000 ffff83005d1f0000 ffff83077fe58188 > (XEN) 000014f4e86a43ab ffff83077fe58180 ffff82d08094a480 ffff88011dd88000 > (XEN) ffff88011dd88000 ffff88011dd88000 0000000000000000 000000000000002b > (XEN) ffffffff81d4c180 0000000000000000 00000013fe969894 0000000000000001 > (XEN) 0000000000000000 ffffffff81020e50 0000000000000000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 000000fc00000000 ffffffff81060182 > (XEN) Xen call trace: > (XEN) [<ffff82d08022879d>] sched_credit.c#csched_schedule+0xaad/0xba0 > (XEN) [<ffff82d0803577ef>] common_interrupt+0x8f/0x110 > (XEN) [<ffff82d0803577ef>] common_interrupt+0x8f/0x110 > (XEN) [<ffff82d0803577fb>] common_interrupt+0x9b/0x110 > (XEN) [<ffff82d08023153d>] schedule.c#schedule+0xdd/0x5d0 > (XEN) [<ffff82d080268705>] reprogram_timer+0x75/0xe0 > (XEN) [<ffff82d080238578>] timer.c#timer_softirq_action+0x138/0x210 > (XEN) [<ffff82d080234cb2>] softirq.c#__do_softirq+0x62/0x90 > (XEN) [<ffff82d08026daf5>] domain.c#idle_loop+0x45/0xb0 > (XEN) **************************************** > (XEN) Panic on CPU 30: > (XEN) Xen BUG at sched_credit.c:1694 > (XEN) **************************************** > (XEN) Reboot in five seconds... > > But after that the system hangs hard, one has to pull the plug. > Running the debug version of xen.efi did not trigger any ASSERT. > > > This happens if there are many busy backend/frontend pairs in a number > of domUs. I think more domUs will trigger it sooner, overcommit helps as > well. It was not seen with a single domU. > > The testcase is like that: > - boot dom0 with "dom0_max_vcpus=30 dom0_mem=32G dom0_vcpus_pin" > - create a tmpfs in dom0 > - create files in that tmpfs to be exported to domUs via file://path,xvdtN,w > - assign these files to HVM domUs > - inside the domUs, create a filesystem on the xvdtN devices > - mount the filesystem > - run fio(1) on the filesystem > - in dom0, run 'xl vcpu-pin domU $node1-3 $nodeN' in a loop to move domU > between node 1 to 3. > > After a low number of iterations Xen crashes in csched_load_balance. > > In my setup I had 16 HVM domUs with 64 vcpus, each one had 3 vbd devices. > It was reported also with fewer and smaller domUs. > Scripts exist to recreate the setup easily. > > [snip] > > Any idea what might causing this crash? Assuming the bug is this one: BUG_ON( cpu != snext->vcpu->processor ); a nasty race condition… a vcpu has just been taken off the runqueue of the current pcpu, but it’s apparently been assigned to a different cpu. Let me take a look. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.