[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Live-Patch application failure in core-scheduling mode
On 07.02.2020 09:04, Jürgen Groß wrote: > On 06.02.20 15:02, Sergey Dyasli wrote: >> On 06/02/2020 11:05, Sergey Dyasli wrote: >>> On 06/02/2020 09:57, Jürgen Groß wrote: >>>> On 05.02.20 17:03, Sergey Dyasli wrote: >>>>> Hello, >>>>> >>>>> I'm currently investigating a Live-Patch application failure in core- >>>>> scheduling mode and this is an example of what I usually get: >>>>> (it's easily reproducible) >>>>> >>>>> (XEN) [ 342.528305] livepatch: lp: CPU8 - IPIing the other 15 CPUs >>>>> (XEN) [ 342.558340] livepatch: lp: Timed out on semaphore in CPU >>>>> quiesce phase 13/15 >>>>> (XEN) [ 342.558343] bad cpus: 6 9 >>>>> >>>>> (XEN) [ 342.559293] CPU: 6 >>>>> (XEN) [ 342.559562] Xen call trace: >>>>> (XEN) [ 342.559565] [<ffff82d08023f304>] R >>>>> common/schedule.c#sched_wait_rendezvous_in+0xa4/0x270 >>>>> (XEN) [ 342.559568] [<ffff82d08023f8aa>] F >>>>> common/schedule.c#schedule+0x17a/0x260 >>>>> (XEN) [ 342.559571] [<ffff82d080240d5a>] F >>>>> common/softirq.c#__do_softirq+0x5a/0x90 >>>>> (XEN) [ 342.559574] [<ffff82d080278ec5>] F >>>>> arch/x86/domain.c#guest_idle_loop+0x35/0x60 >>>>> >>>>> (XEN) [ 342.559761] CPU: 9 >>>>> (XEN) [ 342.560026] Xen call trace: >>>>> (XEN) [ 342.560029] [<ffff82d080241661>] R >>>>> _spin_lock_irq+0x11/0x40 >>>>> (XEN) [ 342.560032] [<ffff82d08023f323>] F >>>>> common/schedule.c#sched_wait_rendezvous_in+0xc3/0x270 >>>>> (XEN) [ 342.560036] [<ffff82d08023f8aa>] F >>>>> common/schedule.c#schedule+0x17a/0x260 >>>>> (XEN) [ 342.560039] [<ffff82d080240d5a>] F >>>>> common/softirq.c#__do_softirq+0x5a/0x90 >>>>> (XEN) [ 342.560042] [<ffff82d080279db5>] F >>>>> arch/x86/domain.c#idle_loop+0x55/0xb0 >>>>> >>>>> The first HT sibling is waiting for the second in the LP-application >>>>> context while the second waits for the first in the scheduler context. >>>>> >>>>> Any suggestions on how to improve this situation are welcome. >>>> >>>> Can you test the attached patch, please? It is only tested to boot, so >>>> I did no livepatch tests with it. >>> >>> Thank you for the patch! It seems to fix the issue in my manual testing. >>> I'm going to submit automatic LP testing for both thread/core modes. >> >> Andrew suggested to test late ucode loading as well and so I did. >> It uses stop_machine() to rendezvous cpus and it failed with a similar >> backtrace for a problematic CPU. But in this case the system crashed >> since there is no timeout involved: >> >> (XEN) [ 155.025168] Xen call trace: >> (XEN) [ 155.040095] [<ffff82d0802417f2>] R >> _spin_unlock_irq+0x22/0x30 >> (XEN) [ 155.069549] [<ffff82d08023f3c2>] S >> common/schedule.c#sched_wait_rendezvous_in+0xa2/0x270 >> (XEN) [ 155.109696] [<ffff82d08023f728>] F >> common/schedule.c#sched_slave+0x198/0x260 >> (XEN) [ 155.145521] [<ffff82d080240e1a>] F >> common/softirq.c#__do_softirq+0x5a/0x90 >> (XEN) [ 155.180223] [<ffff82d0803716f6>] F >> x86_64/entry.S#process_softirqs+0x6/0x20 >> >> It looks like your patch provides a workaround for LP case, but other >> cases like stop_machine() remain broken since the underlying issue with >> the scheduler is still there. > > And here is the fix for ucode loading (that was in fact the only case > where stop_machine_run() wasn't already called in a tasklet). This is a rather odd restriction, and hence will need explaining. Without it being entirely clear that there's no alternative to it, I don't think I'd be fine with re-introduction of continue_hypercall_on_cpu(0, ...) into ucode loading. Also two remarks on the patch itself: struct ucode_buf's len field can be unsigned int, seeing the very first check done in microcode_update(). And instead of xmalloc_bytes() please see whether you can make use of xmalloc_flex_struct() there. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |