[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Regression, host crash with 4.5rc1



Hi Len, thanks for chiming in. I am a Xen noob and generally clueless to the inner workings of this power management stuff, so apologies in advance if I don't understand what is asked. I am, however, happy to try whatever you'd like me to in pursuing this issue.

On 03/02/2015 07:24 AM, Jan Beulich wrote:
On 27.02.15 at 18:50, <len.brown@xxxxxxxxx> wrote:
If this issue were to happen on Linux/bare-metal, this is how I'd debug it.
Hopefully some of this will translate to Xen in one way or another.
Sadly not really - the kernel plays only a minor role (forwarding ACPI
data to the hypervisor) in C-state handling under Xen.

dmesg | grep idle
will tell us what idle driver is running (on Dom0 kernel)
and if it is intel_idle, it will also tell us the supported sub-states
(CPUID.MWAIT.EDX value)

root@g2:~# dmesg | grep idle
[    0.000000]     RCU dyntick-idle grace-period acceleration is enabled.
[   11.391708] intel_idle: MWAIT substates: 0x1120
[   11.391711] intel_idle: v0.4 model 0x2C
[   11.391712] intel_idle: lapic_timer_reliable_states 0xffffffff
[   11.391780] intel_idle: intel_idle yielding to none

(This output is the same whether I've got max_cstate=2 set or not.)

Yeah, we call the driver mwait-idle in the hypervisor, and the log
would be accssible via "xl dmesg", but yes, that information is
available there too.

(XEN)     C1:   type[C1] latency[003] usage[12219860] method[  FFH]
duration[1190961948551]
(XEN)     C2:   type[C1] latency[010] usage[10205554] method[  FFH]
duration[2015393965907]
(XEN)     C3:   type[C2] latency[020] usage[50926286] method[  FFH]
duration[30527997858148]
I'm hopeful that this information comes from the hardware's BIOS
and not some hypervisor tricking out Dom0 with a fake BIOS, yes?
In the case of mwait-idle (intel_idle on Linux) it would be built-in
knowledge of the driver. For acpi-cpuidle it would come from
actual firmware, not anything fake/virtual.

Next, hopefully the attached turbostat utility can be invoked on Dom0
and it can read the MSRs on at least 1 processor via the /dev/cpu interface.
Yes, that would be possible, provided it's not important what specific
CPU it gets executed on.

I've run it (with the "max_cstate=2" intact from Xen's boot line) and the output is as follows, while running the problematic graphics benchmark on my Win 7 VM:

root@g2:~/turbostat-test# ./turbostat
./turbostat: APERF or MPERF went backwards *
* Frequency results do not cover entire interval *
* fix this by running Linux-2.6.30 or later *
     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -   36804********    2736    2800
       0   64323********    2560    2800
       1    8244********    3398    2800
       2  125758********    2760    2800
       3   17811********    3032    2800
       4     735********    2977    2800
       5    3954********    2656    2800
     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -   47728********    2804    2800
       0   18007********    3025    2800
       1   69086********    2634    2800
       2     522********    2713    2800
       3   77486********    2680    2800
       4   58487********    2932    2800
       5   62777********    3006    2800
     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -   49031********    2728    2800
       0   78178********    2681    2800
       1   62045********    2561    2800
       2    9060********    3110    2800
       3   16619********    3255    2800
       4     720********    2661    2800
       5  127565********    2763    2800
     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -   65471********    2700    2800
       0   70582********    2638    2800
       1    2173********    1954    2800
       2   49981********    2899    2800
       3   78668********    2682    2800
       4  128293********    2762    2800
       5   63131********    2566    2800

Not sure why the warning about the kernel version, this box is running Debian's Linux 3.16 kernel.

With "max_cstate=2" removed from Xen's boot line, this is the result:

root@g2:~/turbostat-test# ./turbostat
./turbostat: APERF or MPERF went backwards *
* Frequency results do not cover entire interval *
* fix this by running Linux-2.6.30 or later *
     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -   23507********    2621    2800
       0   27631********    2552    2800
       1   35945********    2978    2800
       2   24417********    2472    2800
       3    1001********    2948    2800
       4   24417********    2472    2800
       5   27631********    2552    2800
     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -   14114********    2687    2800
       0     529********    2738    2800
       1   60363********    2750    2800
       2   21290********    2497    2800
       3    1028********    2934    2800
       4     629********    2943    2800
       5     842********    2937    2800
     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -   15048********    2714    2800
       0   25703********    2489    2800
       1   36024********    2975    2800
       2    5248********    2454    2800
       3    5248********    2454    2800
       4    9138********    2755    2800
       5    8925********    2751    2800
     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -   32859********    2598    2800
       0   23089********    2526    2800
       1   61730********    2751    2800
       2   26138********    2492    2800
       3   26138********    2492    2800
       4   30029********    2574    2800
       5   30029********    2574    2800


It may tell us just the same thing I think we learned here:

(XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0]
(XEN) CC3[28794734145697] CC6[0] CC7[0]
which I'm assuming are a dump of the MSR residency counters.
If yes, it appears to be that this platform is not invoking c6 and pc6 at
all,
and that the deepest state being used is actually cc3 and pc3.
I don't know if that is because you've booted the kernel with max_cstate=N
of some kind, or if this is default.
Sadly I haven't been able to tell which original mail the quotes
above are from, and since I had Steve experiment with disabling
the deepest C-state permitted to be used, it may well be that
this output was from one of those experiments. Remember, we
already know that with use of C6 alone disabled things work for
him (Steve - please correct me if I'm misremembering).

AIUI, that is correct. My Xen boot line (which eliminates the dom0/U hangs) includes "mwait-idle=1 max_cstate=2".

Guessing...
If no surprises in the debug stuff requested above, and
If the XEN debug stuff above is with c6 explicitly disabled...
Note that here are two kinds of c6 -- CC6 (core) and PC6 (package).
If this box supports both, the next thing to try will be to keep CC6
enabled, but to just disable PC6.  This is done via an MSR that turbostat
dumps out (MSR_NHM_SNB_PKG_CST_CFG_CTL) via the wrmsr(8) utility.
I don't think the wrmsr tool can be used (unmodified) to reliably do
this on all CPUs in the system - we'd likely have to cook up a patch
to the hypervisor instead, or I'd have to hand my patch to msr-tools
to Steve so he could use the tool under Xen (albeit that would also
require him to use one of our forward ported kernels, as the
upstream one doesn't have a pCPU sysfs interface yet afaik).

I'm game for whatever.

Though if that MSR is locked by the BIOS, then BIOS SETUP option
may be the only way to disable the package C-state limit without
also disabling the associated core C-state.
Steve, could you check whether any such option exists (it's been
a while, so apologies if we had asked already)?

No problem. I've cruised through the BIOS options and this is what I see that may apply:





If you'd like me to make any changes to those settings, please let me know. For reference this is a Lenovo ThinkStation D20 running a Xeon X5660.

Thanks!

Steve

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.