[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs
On 07/02/18 15:06, Jan Beulich wrote: >>>> On 07.02.18 at 14:24, <andrew.cooper3@xxxxxxxxxx> wrote: >> On 07/02/18 13:08, Jan Beulich wrote: >>>>>> On 07.02.18 at 14:01, <igor.druzhinin@xxxxxxxxxx> wrote: >>>> So far the issue confirmed: >>>> Dell PowerEdge R740, Huawei systems based on Xeon Gold 6152 (the one >>>> that it was tested on), Intel S2600XX, etc. >>>> >>>> Also see: >>>> https://bugs.xenserver.org/browse/XSO-774 >>>> >>>> Well, no-watchdog is what we currently recommend in that case but we >>>> hoped there is a general solution here from Xen side. You have your >>>> point that they should fix this on their side because it's their fault >>>> indeed. But the user experience is also important for us I think. >>> Of course, hence the suggestion of possible alternative workarounds. >>> Impacting everyone is, as said, not a desirable approach in a case >>> like this one. I also continue to dislike the seemingly random division >>> by 10. >> Xen's usability is crap, which is in large part due to attitude like >> this. It is not ok to expect the end user to know diagnose/debug issues >> like this, and it is entirely unreasonable to expect the end user to >> have to manually work around it. > Excuse me? The watchdog is off by default. Anyone turning it on > ought to know what they do. You (iirc) turning it on unilaterally in > XenServer puts the burden of avoidng users to have to diagnose > the issue on you. And we have taken the burden of diagnosing the issue, as well as proposing a fix. > >> This particular issue does want feeding back to Intel so they can try >> and fix it, but whatever is wrong is present in a large amount of >> Skylake systems in the field. Xen needs to be able to cope. > But in a reasonable way. > >> Finally, as to boot times, your argument is backwards seeing as you care >> about elapsed boot time. Slowing the frequency will speed everything >> up, as we aren't executing a large chunk of the BSP boot path with 100hz >> NMI constantly interrupting. > How long does handling a single NMI take? Microseconds, I assume. > Contrast this with waiting for two NMIs to arrive in wait_for_nmis(), > which goes up from 20ms to 200ms with this change. So you're argument is to not change the frequency because an off-by-default option will *in the best case* add a few hundred milliseconds extra to the boot time? Times to boot computers are measured in minutes, not milliseconds. I don't know how long servicing an NMI takes, at a minimum of a rdmsr, wrmsr and then a further mmio write or wrmsr, I doubt it is that quick. > Also you completely ignore my argument against the seemingly > random division by 10, including the resulting question of what you > mean to do once 10Hz also turns out too high a frequency. We've got to pick a frequency. The current 100Hz is just as arbitrary as the proposed new 10Hz. > I wouldn't, btw, mind an attempt to avoid the high rate NMIs > during early boot (if those occur in the first place, which from > two successive replies by Igor yesterday I wasn't sure anymore > is an actual fact), but that's independent of the issue at hand. The 100Hz NMI is active from BSP APIC init, IO-APIC, deadline timer calibration, mwait idle, the entirety of HVM setup and full AP bringup. On one of my fastest boxes, it is about 1 second of wallclock time. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |