[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Recent upgrade of 4.13 -> 4.14 issue
On Sat, Oct 31, 2020 at 04:27:58AM +0100, Dario Faggioli wrote: > On Sat, 2020-10-31 at 03:54 +0100, marmarek@xxxxxxxxxxxxxxxxxxxxxx > wrote: > > On Sat, Oct 31, 2020 at 02:34:32AM +0000, Dario Faggioli wrote: > > (XEN) *** Dumping CPU7 host state: *** > > (XEN) Xen call trace: > > (XEN) [<ffff82d040223625>] R _spin_lock+0x35/0x40 > > (XEN) [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0 > > (XEN) [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0 > > (XEN) [<ffff82d0402df160>] S context_switch+0x110/0xa60 > > (XEN) [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250 > > (XEN) [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0 > > (XEN) [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30 > > > > And so on, for (almost?) all CPUs. > > Right. So, it seems like a live (I would say) lock. It might happen on > some resource which his shared among domains. And introduced (the > livelock, not the resource or the sharing) in 4.14. > > Just giving a quick look, I see that vmx_do_resume() calls > vmx_clear_vmcs() which calls on_selected_cpus() which takes the > call_lock spinlock. > > And none of these seems to have received much attention recently. > > But this is just a really basic analysis! I've looked at on_selected_cpus() and my understanding is this: 1. take call_lock spinlock 2. set function+args+what cpus to be called in a global "call_data" variable 3. ask CPUs to execute that function (smp_send_call_function_mask() call) 4. wait for all requested CPUs to execute the function, still holding the spinlock 5. only then - release the spinlock So, if any CPU does not execute requested function for any reason, it will keep the call_lock locked forever. I don't see any CPU waiting on step 4, but also I don't see call traces from CPU3 and CPU8 in the log - that's because they are in guest (dom0 here) context, right? I do see "guest state" dumps from them. The only three CPUs that do logged xen call traces and are not waiting on that spin lock are: CPU0: (XEN) Xen call trace: (XEN) [<ffff82d040240f89>] R vcpu_unblock+0x9/0x50 (XEN) [<ffff82d0402e0171>] S vcpu_kick+0x11/0x60 (XEN) [<ffff82d0402259c8>] S tasklet.c#do_tasklet_work+0x68/0xc0 (XEN) [<ffff82d040225a59>] S tasklet.c#tasklet_softirq_action+0x39/0x60 (XEN) [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0 (XEN) [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30 CPU4: (XEN) Xen call trace: (XEN) [<ffff82d040227043>] R set_timer+0x133/0x220 (XEN) [<ffff82d040234e90>] S credit.c#csched_tick+0/0x3a0 (XEN) [<ffff82d04022660f>] S timer.c#timer_softirq_action+0x9f/0x300 (XEN) [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0 (XEN) [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20 CPU14: (XEN) Xen call trace: (XEN) [<ffff82d040222dc0>] R do_softirq+0/0x10 (XEN) [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20 I'm not sure if any of those is related to that spin lock, on_selected_cpus() call, or anything like that... -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? Attachment:
signature.asc
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |