[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Recent upgrade of 4.13 -> 4.14 issue
Le 10/26/20 à 6:54 PM, Dario Faggioli a écrit : On Mon, 2020-10-26 at 17:11 +0100, Frédéric Pierret wrote:Le 10/26/20 à 2:54 PM, Andrew Cooper a écrit :If anyone would have any idea of what's going on, that would be very appreciated. Thank you.Does booting Xen with `sched=credit` make a difference? ~AndrewThank you Andrew. Since your mail I'm currently testing this on production and it's clearly more stable than this morning. I will not say yet it's solved because yesterday I had some few hours of stability too. but clearly, it's encouraging because this morning it was just hell every 15/30 minutes.Ok, yes, let us know if the credit scheduler seems to not suffer from the issue. Yes unfortunately, I had few hours of stability but it just end up to: ``` [15883.967829] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [15883.967868] rcu: 12-...0: (75 ticks this GP) idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=14879 [15883.967884] (detected by 0, t=60002 jiffies, g=460221, q=89) [15883.967901] Sending NMI from CPU 0 to CPUs 12: [15893.970590] rcu: rcu_sched kthread starved for 9994 jiffies! g460221 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=9 [15893.970622] rcu: RCU grace-period kthread stack dump: [15893.970631] rcu_sched R running task 0 10 2 0x80004008 [15893.970645] Call Trace: [15893.970658] ? xen_hypercall_xen_version+0xa/0x20 [15893.970670] ? xen_force_evtchn_callback+0x9/0x10 [15893.970679] ? check_events+0x12/0x20 [15893.970687] ? xen_restore_fl_direct+0x1f/0x20 [15893.970697] ? _raw_spin_unlock_irqrestore+0x14/0x20 [15893.970708] ? force_qs_rnp+0x6f/0x170 [15893.970715] ? rcu_nocb_unlock_irqrestore+0x30/0x30 [15893.970724] ? rcu_gp_fqs_loop+0x234/0x2a0 [15893.970732] ? rcu_gp_kthread+0xb5/0x140 [15893.970740] ? rcu_gp_init+0x470/0x470 [15893.970748] ? kthread+0x115/0x140 [15893.970756] ? __kthread_bind_mask+0x60/0x60 [15893.970764] ? ret_from_fork+0x35/0x40 [16063.972793] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16063.972825] rcu: 12-...0: (75 ticks this GP) idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=57364 [16063.972840] (detected by 5, t=240007 jiffies, g=460221, q=6439) [16063.972855] Sending NMI from CPU 5 to CPUs 12: [16243.977769] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16243.977802] rcu: 12-...0: (75 ticks this GP) idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=99504 [16243.977817] (detected by 11, t=420012 jiffies, g=460221, q=6710) [16243.977830] Sending NMI from CPU 11 to CPUs 12: [16253.980496] rcu: rcu_sched kthread starved for 10001 jiffies! g460221 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=9 [16253.980528] rcu: RCU grace-period kthread stack dump: [16253.980537] rcu_sched R running task 0 10 2 0x80004008 [16253.980550] Call Trace: [16253.980563] ? xen_hypercall_xen_version+0xa/0x20 [16253.980575] ? xen_force_evtchn_callback+0x9/0x10 [16253.980584] ? check_events+0x12/0x20 [16253.980592] ? xen_restore_fl_direct+0x1f/0x20 [16253.980602] ? _raw_spin_unlock_irqrestore+0x14/0x20 [16253.980613] ? force_qs_rnp+0x6f/0x170 [16253.980620] ? rcu_nocb_unlock_irqrestore+0x30/0x30 [16253.980629] ? rcu_gp_fqs_loop+0x234/0x2a0 [16253.980637] ? rcu_gp_kthread+0xb5/0x140 [16253.980645] ? rcu_gp_init+0x470/0x470 [16253.980653] ? kthread+0x115/0x140 [16253.980661] ? __kthread_bind_mask+0x60/0x60 [16253.980669] ? ret_from_fork+0x35/0x40 [16423.982735] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16423.982789] rcu: 12-...0: (75 ticks this GP) idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=139435 [16423.982820] (detected by 10, t=600017 jiffies, g=460221, q=7354) [16423.982842] Sending NMI from CPU 10 to CPUs 12: [16433.984844] rcu: rcu_sched kthread starved for 10001 jiffies! g460221 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=3 [16433.984875] rcu: RCU grace-period kthread stack dump: [16433.984885] rcu_sched R running task 0 10 2 0x80004000 [16433.984897] Call Trace: [16433.984910] ? xen_hypercall_xen_version+0xa/0x20 [16433.984922] ? xen_force_evtchn_callback+0x9/0x10 [16433.984931] ? check_events+0x12/0x20 [16433.984939] ? xen_restore_fl_direct+0x1f/0x20 [16433.984949] ? _raw_spin_unlock_irqrestore+0x14/0x20 [16433.984960] ? force_qs_rnp+0x6f/0x170 [16433.984967] ? rcu_nocb_unlock_irqrestore+0x30/0x30 [16433.984976] ? rcu_gp_fqs_loop+0x234/0x2a0 [16433.984984] ? rcu_gp_kthread+0xb5/0x140 [16433.984992] ? rcu_gp_init+0x470/0x470 [16433.985000] ? kthread+0x115/0x140 [16433.985007] ? __kthread_bind_mask+0x60/0x60 [16433.985015] ? ret_from_fork+0x35/0x40 [16603.987677] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16603.987710] rcu: 12-...0: (75 ticks this GP) idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=179313 [16603.987725] (detected by 0, t=780022 jiffies, g=460221, q=7869) [16603.987740] Sending NMI from CPU 0 to CPUs 12: [16783.992658] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16783.992710] rcu: 12-...0: (75 ticks this GP) idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=219106 [16783.992741] (detected by 13, t=960027 jiffies, g=460221, q=8300) [16783.992768] Sending NMI from CPU 13 to CPUs 12: [16793.995873] rcu: rcu_sched kthread starved for 10000 jiffies! g460221 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=4 [16793.995906] rcu: RCU grace-period kthread stack dump: [16793.995915] rcu_sched R running task 0 10 2 0x80004000 [16793.995930] Call Trace: [16793.995948] ? xen_hypercall_xen_version+0xa/0x20 [16793.995963] ? xen_force_evtchn_callback+0x9/0x10 [16793.995972] ? check_events+0x12/0x20 [16793.995979] ? xen_restore_fl_direct+0x1f/0x20 [16793.995992] ? _raw_spin_unlock_irqrestore+0x14/0x20 [16793.996004] ? force_qs_rnp+0x6f/0x170 [16793.996012] ? rcu_nocb_unlock_irqrestore+0x30/0x30 [16793.996021] ? rcu_gp_fqs_loop+0x234/0x2a0 [16793.996029] ? rcu_gp_kthread+0xb5/0x140 [16793.996037] ? rcu_gp_init+0x470/0x470 [16793.996046] ? kthread+0x115/0x140 [16793.996054] ? __kthread_bind_mask+0x60/0x60 [16793.996062] ? ret_from_fork+0x35/0x40 ``` I'm curious about another thing, though. You mentioned, in your previous email (and in the subject :-)) that this is a 4.13 -> 4.14 issue for you? This is indeed happening since I've updated xen-4.14 from 4.13 and 4.13 was totally stable for me. Server was running for months without any issue. Does that mean that the problem was not there on 4.13? I'm asking because Credit2 was already the default scheduler in 4.13. So, unless you were configuring things differently, you were already using it there. Normally, there is a new custom patch for S3 resume from Marek (in CC) and he would be much more able than me to precise some very specific changes with respect to 4.13. If this is the case, it would hint at the fact that something that changed between .13 and .14 could be the cause. Regards Thank you again for your help. Attachment:
OpenPGP_0x484010B5CDC576E2.asc Attachment:
OpenPGP_signature
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |