[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Recent upgrade of 4.13 -> 4.14 issue



On Sat, Oct 31, 2020 at 02:34:32AM +0000, Dario Faggioli wrote:
> On Tue, 2020-10-27 at 17:06 +0100, Frédéric Pierret wrote:
> > 
> > Ok the server got frozen just few minutes after my mail and I got
> > now:
> > 'r': https://gist.github.com/fepitre/78541f555902275d906d627de2420571
> >
> From the scheduler point of view, things seems fine:
> 
> (XEN) sched_smt_power_savings: disabled
> (XEN) NOW=770188952085
> (XEN) Online Cpus: 0-15
> (XEN) Cpupool 0:
> (XEN) Cpus: 0-15
> (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
> (XEN) Scheduler: SMP Credit Scheduler rev2 (credit2)
> (XEN) Active queues: 2
> (XEN)         default-weight     = 256
> (XEN) Runqueue 0:
> (XEN)         ncpus              = 8
> (XEN)         cpus               = 0-7
> (XEN)         max_weight         = 256
> (XEN)         pick_bias          = 1
> (XEN)         instload           = 7
> (XEN)         aveload            = 2021119 (~770%)
> (XEN)         idlers: 00000000,00000000
> (XEN)         tickled: 00000000,00000000
> (XEN)         fully idle cores: 00000000,00000000
> (XEN) Runqueue 1:
> (XEN)         ncpus              = 8
> (XEN)         cpus               = 8-15
> (XEN)         max_weight         = 256
> (XEN)         pick_bias          = 9
> (XEN)         instload           = 8
> (XEN)         aveload            = 2097259 (~800%)
> (XEN)         idlers: 00000000,00000000
> (XEN)         tickled: 00000000,00000200
> (XEN)         fully idle cores: 00000000,00000000
> 
> The system is pretty busy, but not in overload.
> 
> Below we see that CPU 3 is running the idle vCPU, but it's not marked
> as neither idle nor tickled.
> 
> It may be running a tasklet (the one that dumps the debug key output, I
> guess).
> 
> Credits are fine, I don't see any strange values that may indicate
> anomalies or something.
> 
> All the CPUs are executing a vCPU, and there should be nothing that
> prevent them making progresses.
> 
> There is one vCPU which apparetnly want to run at 100% in pretty much
> all guests, and more than one in dom0.
> 
> And I think I saw some spin_lock() in the call stacks, in the partial
> report of '*' debug-key?

Yes, I see:

(XEN) *** Dumping CPU1 host state: ***
(...)
(XEN) Xen call trace:
(XEN)    [<ffff82d040223625>] R _spin_lock+0x35/0x40
(XEN)    [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0
(XEN)    [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0
(XEN)    [<ffff82d0402df160>] S context_switch+0x110/0xa60
(XEN)    [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20
(...)
(XEN) *** Dumping CPU2 host state: ***
(XEN) Xen call trace:
(XEN)    [<ffff82d040223622>] R _spin_lock+0x32/0x40
(XEN)    [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0
(XEN)    [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0
(XEN)    [<ffff82d0402df160>] S context_switch+0x110/0xa60
(XEN)    [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30

(XEN) *** Dumping CPU5 host state: ***
(XEN) Xen call trace:
(XEN)    [<ffff82d040223622>] R _spin_lock+0x32/0x40
(XEN)    [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0
(XEN)    [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0
(XEN)    [<ffff82d0402df160>] S context_switch+0x110/0xa60
(XEN)    [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30

(XEN) *** Dumping CPU6 host state: ***
(XEN) Xen call trace:
(XEN)    [<ffff82d040223625>] R _spin_lock+0x35/0x40
(XEN)    [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0
(XEN)    [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0
(XEN)    [<ffff82d0402df160>] S context_switch+0x110/0xa60
(XEN)    [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20

(XEN) *** Dumping CPU7 host state: ***
(XEN) Xen call trace:
(XEN)    [<ffff82d040223625>] R _spin_lock+0x35/0x40
(XEN)    [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0
(XEN)    [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0
(XEN)    [<ffff82d0402df160>] S context_switch+0x110/0xa60
(XEN)    [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30

And so on, for (almost?) all CPUs.

Note the '*' output is (I think) from a different instances of the
freeze, so cannot be correlated with other outputs...

> Maybe they're stuck in the kernel, not in Xen? Thoughs ?

Given the above spin locks, I don't think so. But also, even if they are
stuck in the kernel, it clearly happened after 4.13 -> 4.14 upgrade...

> (XEN) Domain info:
> (XEN)         Domain: 0 w 256 c 0 v 16
> (XEN)           1: [0.0] flags=0 cpu=5 credit=10553147 [w=256] load=17122
> (~6%)
> (XEN)           2: [0.1] flags=0 cpu=4 credit=10570606 [w=256] load=13569
> (~5%)
> (XEN)           3: [0.2] flags=0 cpu=7 credit=10605188 [w=256] load=13465
> (~5%)
> (XEN)           4: [0.3] flags=2 cpu=11 credit=9998469 [w=256] load=262144
> (~100%)
> (XEN)           5: [0.4] flags=0 cpu=0 credit=10533686 [w=256] load=13619
> (~5%)
> (XEN)           6: [0.5] flags=a cpu=9 credit=1101 [w=256] load=0 (~0%)
> (XEN)           7: [0.6] flags=2 cpu=2 credit=10621802 [w=256] load=13526
> (~5%)
> (XEN)           8: [0.7] flags=2 cpu=1 credit=10670607 [w=256] load=13453
> (~5%)
> (XEN)           9: [0.8] flags=2 cpu=7 credit=10649858 [w=256] load=13502
> (~5%)
> (XEN)          10: [0.9] flags=0 cpu=3 credit=10550566 [w=256] load=13477
> (~5%)
> (XEN)          11: [0.10] flags=2 cpu=4 credit=10644321 [w=256] load=13539
> (~5%)
> (XEN)          12: [0.11] flags=2 cpu=1 credit=10602374 [w=256] load=13471
> (~5%)
> (XEN)          13: [0.12] flags=0 cpu=6 credit=10617262 [w=256] load=13801
> (~5%)
> (XEN)          14: [0.13] flags=2 cpu=8 credit=9998664 [w=256] load=262144
> (~100%)
> (XEN)          15: [0.14] flags=0 cpu=3 credit=10603305 [w=256] load=17020
> (~6%)
> (XEN)          16: [0.15] flags=0 cpu=5 credit=10591312 [w=256] load=13523
> (~5%)
> (XEN)         Domain: 1 w 256 c 0 v 2
> (XEN)          17: [1.0] flags=2 cpu=6 credit=4916769 [w=256] load=262144
> (~100%)
> (XEN)          18: [1.1] flags=0 cpu=13 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 2 w 256 c 0 v 2
> (XEN)          19: [2.0] flags=2 cpu=5 credit=4982064 [w=256] load=262144
> (~100%)
> (XEN)          20: [2.1] flags=0 cpu=14 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 3 w 256 c 0 v 2
> (XEN)          21: [3.0] flags=2 cpu=1 credit=5200781 [w=256] load=262144
> (~100%)
> (XEN)          22: [3.1] flags=0 cpu=5 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 4 w 256 c 0 v 2
> (XEN)          23: [4.0] flags=12 cpu=0 credit=5395149 [w=256] load=262144
> (~100%)
> (XEN)          24: [4.1] flags=0 cpu=8 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 5 w 256 c 0 v 2
> (XEN)          25: [5.0] flags=2 cpu=2 credit=5306461 [w=256] load=262144
> (~100%)
> (XEN)          26: [5.1] flags=0 cpu=15 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 6 w 256 c 0 v 8
> (XEN)          27: [6.0] flags=12 cpu=10 credit=7915602 [w=256] load=262144
> (~100%)
> (XEN)          28: [6.1] flags=0 cpu=10 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          29: [6.2] flags=0 cpu=15 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          30: [6.3] flags=0 cpu=8 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          31: [6.4] flags=0 cpu=0 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          32: [6.5] flags=0 cpu=6 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          33: [6.6] flags=0 cpu=11 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          34: [6.7] flags=0 cpu=9 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 7 w 256 c 0 v 2
> (XEN)          35: [7.0] flags=2 cpu=4 credit=5297013 [w=256] load=262144
> (~100%)
> (XEN)          36: [7.1] flags=0 cpu=11 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 8 w 256 c 0 v 4
> (XEN)          37: [8.0] flags=0 cpu=14 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          38: [8.1] flags=2 cpu=7 credit=5240630 [w=256] load=262144
> (~100%)
> (XEN)          39: [8.2] flags=0 cpu=13 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          40: [8.3] flags=0 cpu=8 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 9 w 256 c 0 v 2
> (XEN)          41: [9.0] flags=0 cpu=0 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          42: [9.1] flags=12 cpu=13 credit=7910266 [w=256] load=262144
> (~100%)
> (XEN)         Domain: 10 w 256 c 0 v 2
> (XEN)          43: [10.0] flags=12 cpu=12 credit=8045458 [w=256] load=262144
> (~100%)
> (XEN)          44: [10.1] flags=0 cpu=8 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 11 w 256 c 0 v 2
> (XEN)          45: [11.0] flags=12 cpu=14 credit=7575284 [w=256] load=262144
> (~100%)
> (XEN)          46: [11.1] flags=0 cpu=12 credit=10500000 [w=256] load=0
> (~0%)
> (XEN)         Domain: 12 w 256 c 0 v 2
> (XEN)          47: [12.0] flags=2 cpu=15 credit=8014099 [w=256] load=262144
> (~100%)
> (XEN)          48: [12.1] flags=0 cpu=6 credit=10500000 [w=256] load=0 (~0%)
> (XEN)         Domain: 13 w 256 c 0 v 2
> (XEN)          49: [13.0] flags=0 cpu=7 credit=10500000 [w=256] load=0 (~0%)
> (XEN)          50: [13.1] flags=0 cpu=15 credit=10500000 [w=256] load=0
> (~0%)
> (XEN) Runqueue 0:
> (XEN) CPU[00] runq=0, sibling={0}, core={0-7}
> (XEN)         run: [4.0] flags=2 cpu=0 credit=5255200 [w=256] load=262144
> (~100%)
> (XEN) CPU[01] runq=0, sibling={1}, core={0-7}
> (XEN)         run: [3.0] flags=2 cpu=1 credit=5057668 [w=256] load=262144
> (~100%)
> (XEN) CPU[02] runq=0, sibling={2}, core={0-7}
> (XEN)         run: [5.0] flags=12 cpu=2 credit=5180785 [w=256] load=262144
> (~100%)
> (XEN) CPU[03] runq=0, sibling={3}, core={0-7}
> (XEN) CPU[04] runq=0, sibling={4}, core={0-7}
> (XEN)         run: [7.0] flags=2 cpu=4 credit=5215323 [w=256] load=262144
> (~100%)
> (XEN) CPU[05] runq=0, sibling={5}, core={0-7}
> (XEN)         run: [2.0] flags=2 cpu=5 credit=4816142 [w=256] load=262144
> (~100%)
> (XEN) CPU[06] runq=0, sibling={6}, core={0-7}
> (XEN)         run: [1.0] flags=2 cpu=6 credit=4755772 [w=256] load=262144
> (~100%)
> (XEN) CPU[07] runq=0, sibling={7}, core={0-7}
> (XEN)         run: [8.1] flags=12 cpu=7 credit=5175342 [w=256] load=262144
> (~100%)
> (XEN) RUNQ:
> (XEN) Runqueue 1:
> (XEN) CPU[08] runq=1, sibling={8}, core={8-15}
> (XEN)         run: [0.13] flags=2 cpu=8 credit=9998664 [w=256] load=262144
> (~100%)
> (XEN) CPU[09] runq=1, sibling={9}, core={8-15}
> (XEN)         run: [0.5] flags=a cpu=9 credit=1101 [w=256] load=0 (~0%)
> (XEN) CPU[10] runq=1, sibling={10}, core={8-15}
> (XEN)         run: [6.0] flags=2 cpu=10 credit=7764532 [w=256] load=262144
> (~100%)
> (XEN) CPU[11] runq=1, sibling={11}, core={8-15}
> (XEN)         run: [0.3] flags=2 cpu=11 credit=9998469 [w=256] load=262144
> (~100%)
> (XEN) CPU[12] runq=1, sibling={12}, core={8-15}
> (XEN)         run: [10.0] flags=2 cpu=12 credit=7967846 [w=256] load=262144
> (~100%)
> (XEN) CPU[13] runq=1, sibling={13}, core={8-15}
> (XEN)         run: [9.1] flags=12 cpu=13 credit=7832232 [w=256] load=262144
> (~100%)
> (XEN) CPU[14] runq=1, sibling={14}, core={8-15}
> (XEN)         run: [11.0] flags=2 cpu=14 credit=7509378 [w=256] load=262144
> (~100%)
> (XEN) CPU[15] runq=1, sibling={15}, core={8-15}
> (XEN)         run: [12.0] flags=2 cpu=15 credit=7971164 [w=256] load=262144
> (~100%)
> (XEN) RUNQ:
> (XEN) CPUs info:
> (XEN) CPU[00] current=d4v0, curr=d4v0, prev=NULL
> (XEN) CPU[01] current=d3v0, curr=d3v0, prev=NULL
> (XEN) CPU[02] current=d5v0, curr=d5v0, prev=NULL
> (XEN) CPU[03] current=d[IDLE]v3, curr=d[IDLE]v3, prev=NULL
> (XEN) CPU[04] current=d7v0, curr=d7v0, prev=NULL
> (XEN) CPU[05] current=d2v0, curr=d2v0, prev=NULL
> (XEN) CPU[06] current=d1v0, curr=d1v0, prev=NULL
> (XEN) CPU[07] current=d8v1, curr=d8v1, prev=NULL
> (XEN) CPU[08] current=d0v13, curr=d0v13, prev=NULL
> (XEN) CPU[09] current=d0v5, curr=d0v5, prev=NULL
> (XEN) CPU[10] current=d6v0, curr=d6v0, prev=NULL
> (XEN) CPU[11] current=d0v3, curr=d0v3, prev=NULL
> (XEN) CPU[12] current=d10v0, curr=d10v0, prev=NULL
> (XEN) CPU[13] current=d9v1, curr=d9v1, prev=NULL
> (XEN) CPU[14] current=d11v0, curr=d11v0, prev=NULL
> (XEN) CPU[15] current=d12v0, curr=d12v0, prev=NULL
> 



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.