[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load



Hi Sarah!

On Thu, Feb 13, 2020 at 4:28 PM Sarah Newman <srn@xxxxxxxxx> wrote:
> I'm not one of the Xen developers,

Thank you so much for responding, and for the many clarifications and
pointers!  I am very grateful for your time.  Everything is very
well-taken and very much appreciated.

> I wouldn't necessarily assume the same root cause from the information you've 
> provided so far.

Okay, understood.  This has been plaguing me since I first upgraded my
first host to 4.12, and I totally concede that I could well be
"grasping at straws."  This is a production host for a large client,
and it's to the point where alarms wake me up every 4-5 nights now, so
I am... very interested in finding a solution here.  (I now have
nightmares about my phone ringing, even when it isn't.  Ugh.)

> > 2. I upgraded one physical host to Xen 4.12.1 (OpenSuse 15.1, Linux
> > 4.12.14, Xen 4.12.1).
> > 3. The guest(s) on that host started malfunctioning at that point.
> If you can, try Xen 4.9.4 with Linux 4.12.14 (or Xen 4.12.1 with Linux 
> 4.4.180.)
> That will help isolate the issue to either Xen or the Linux kernel.

Understood.  Tomorrow (Pacific time) I had already planned to change
the physical host (via fresh reload) to OpenSuse 15.0 (in between
these two releases) which has Linux 4.12.14 and Xen 4.10.4 (so
pre-4.12).   I'll do that as an interim step on your plan, and I will
report how that goes.  If it fails as well, I will take the physical
host back to 42.3 (Xen 4.9.4) and install the 4.12 kernel there, and
report.

>  > 4. Tried to use host xl interface to unplug/replug network bridges.
>  > This appeared to work from host side, but guest was unaffected.
> Do you mean 'xl network-detach' and 'xl network-attach'? If not, please give 
> example commands.

I tried both xl network-detach followed by a network-attach (feeding
back in the parameters from my guest machine.)

I also tried using brctl to remove the VIF from the bridge and re-add it, as in:
brctl dellif br0 vif6.0 / brctl addif br0 vif6.0

Neither had any effect on the guest in trouble.

> > 1. Get a server.  I'm using a Dell PowerEdge R720, but this has
> What CPUs are these? Can you dump the information from one of the cpus in 
> /proc/cpuinfo so we can see what microcode version you have, in the highly 
> unlikely case this information is pertinent?

My pleasure.  This problem has happened on several different physical
hosts, I will dump for two of them:

My testing server:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
stepping        : 2
microcode       : 0x43
cpu MHz         : 2596.990
cache size      : 20480 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mca cmov pat
clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc
rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma
cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
lahf_lm abm ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms xsaveopt
md_clear
bugs            : null_seg cpu_meltdown spectre_v1 spectre_v2
spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 5193.98
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

The production machine where this first occurred (and continues to occur):

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz
stepping        : 7
microcode       : 0x710
cpu MHz         : 2100.014
cache size      : 20480 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu de tsc msr pae mce cx8 apic sep mca cmov pat
clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc
rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 cx16
sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm xsaveopt
bugs            : null_seg cpu_meltdown spectre_v1 spectre_v2
spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 4200.02
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

> What about attaching the output of 'xl dmesg' - both the initial boot 
> messages and anything that comes from running the specific domU?

I'll post just from my test machine, skipping the pretty ASCII art -
let me know if that's not right, or if you want to see a second
machine as well.

(XEN) Xen version 4.12.1_06-lp151.2.9 (abuild@xxxxxxx) (gcc (SUSE
Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]) debug=n  Fri Dec
 6 16:56:43 UTC 2019
(XEN) Latest ChangeSet:
(XEN) Bootloader: GRUB2 2.02
(XEN) Command line: dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin
gnttab_max_frames=256 vga=gfx-1024x768x16
(XEN) Xen image load base address: 0
(XEN) Video information:
(XEN)  VGA is graphics mode 1024x768, 16 bpp
(XEN)  VBE/DDC methods: V2; EDID transfer time: 1 seconds
(XEN) Disc information:
(XEN)  Found 2 MBR signatures
(XEN)  Found 2 EDD information structures
(XEN) Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009c000 (usable)
(XEN)  000000000009c000 - 00000000000a0000 (reserved)
(XEN)  00000000000e0000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 000000007a289000 (usable)
(XEN)  000000007a289000 - 000000007af0b000 (reserved)
(XEN)  000000007af0b000 - 000000007b93b000 (ACPI NVS)
(XEN)  000000007b93b000 - 000000007bab8000 (ACPI data)
(XEN)  000000007bab8000 - 000000007bae9000 (usable)
(XEN)  000000007bae9000 - 000000007baff000 (ACPI data)
(XEN)  000000007baff000 - 000000007bb00000 (usable)
(XEN)  000000007bb00000 - 0000000090000000 (reserved)
(XEN)  00000000feda8000 - 00000000fedac000 (reserved)
(XEN)  00000000ff310000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000001880000000 (usable)
(XEN) New Xen image base address: 0x79c00000
(XEN) ACPI: RSDP 000FE320, 0024 (r2 DELL  )
(XEN) ACPI: XSDT 7BAB60E8, 00BC (r1 DELL   PE_SC3          0       1000013)
(XEN) ACPI: FACP 7BAB2000, 00F4 (r4 DELL   PE_SC3          0 DELL        1)
(XEN) ACPI: DSDT 7BA9C000, EACD (r2 DELL   PE_SC3          3 DELL        1)
(XEN) ACPI: FACS 7B8F3000, 0040
(XEN) ACPI: MCEJ 7BAB5000, 0130 (r1 INTEL                  1 INTL  100000D)
(XEN) ACPI: WD__ 7BAB4000, 0134 (r1 DELL   PE_SC3          1 DELL        1)
(XEN) ACPI: SLIC 7BAB3000, 0024 (r1 DELL   PE_SC3          1 DELL        1)
(XEN) ACPI: HPET 7BAB1000, 0038 (r1 DELL   PE_SC3          1 DELL        1)
(XEN) ACPI: APIC 7BAB0000, 0AFC (r2 DELL   PE_SC3          0 DELL        1)
(XEN) ACPI: MCFG 7BAAF000, 003C (r1 DELL   PE_SC3          1 DELL        1)
(XEN) ACPI: MSCT 7BAAE000, 0090 (r1 DELL   PE_SC3          1 DELL        1)
(XEN) ACPI: SLIT 7BAAD000, 006C (r1 DELL   PE_SC3          1 DELL        1)
(XEN) ACPI: SRAT 7BAAB000, 1130 (r3 DELL   PE_SC3          1 DELL        1)
(XEN) ACPI: SSDT 7B959000, 1424A9 (r2 DELL   PE_SC3       4000 INTL 20121114)
(XEN) ACPI: SSDT 7B956000, 217F (r2 DELL   PE_SC3          2 INTL 20121114)
(XEN) ACPI: SSDT 7B955000, 006E (r2 DELL   PE_SC3          2 INTL 20121114)
(XEN) ACPI: PRAD 7B954000, 0132 (r2   DELL PE_SC3          2 INTL 20121114)
(XEN) ACPI: DMAR 7BAFE000, 00F8 (r1 DELL   PE_SC3          1 DELL        1)
(XEN) ACPI: HEST 7BAFD000, 017C (r1 DELL   PE_SC3          2 DELL        1)
(XEN) ACPI: BERT 7BAFC000, 0030 (r1 DELL   PE_SC3          2 DELL        1)
(XEN) ACPI: ERST 7BAFB000, 0230 (r1 DELL   PE_SC3          2 DELL        1)
(XEN) ACPI: EINJ 7BAFA000, 0150 (r1 DELL   PE_SC3          2 DELL        1)
(XEN) System RAM: 98210MB (100567388kB)
(XEN) Domain heap initialised DMA width 32 bits
(XEN) ACPI: 32/64X FACS address mismatch in FADT -
7b8f3000/0000000000000000, using 32
(XEN) IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
(XEN) IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-47
(XEN) IOAPIC[2]: apic_id 10, version 32, address 0xfec40000, GSI 48-71
(XEN) Enabling APIC mode:  Phys.  Using 3 I/O APICs
(XEN) Not enabling x2APIC (upon firmware request)
(XEN) xstate: size: 0x340 and states: 0x7
(XEN) CMCI: threshold 0x2 too large for CPU0 bank 17, using 0x1
(XEN) CMCI: threshold 0x2 too large for CPU0 bank 18, using 0x1
(XEN) CMCI: threshold 0x2 too large for CPU0 bank 19, using 0x1
(XEN) Speculative mitigation facilities:
(XEN)   Hardware features: IBRS/IBPB STIBP L1D_FLUSH SSBD MD_CLEAR
(XEN)   Compiled-in support: INDIRECT_THUNK SHADOW_PAGING
(XEN)   Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: IBRS- SSBD-,
Other: IBPB L1D_FLUSH VERW
(XEN)   L1TF: believed vulnerable, maxphysaddr L1D 46, CPUID 46, Safe
address 300000000000
(XEN)   Support for HVM VMs: MSR_SPEC_CTRL RSB EAGER_FPU MD_CLEAR
(XEN)   Support for PV VMs: MSR_SPEC_CTRL RSB EAGER_FPU MD_CLEAR
(XEN)   XPTI (64-bit PV only): Dom0 enabled, DomU enabled (with PCID)
(XEN)   PV L1TF shadowing: Dom0 disabled, DomU enabled
(XEN) Using scheduler: SMP Credit Scheduler rev2 (credit2)
(XEN) Initializing Credit2 scheduler
(XEN) Platform timer is 14.318MHz HPET
(XEN) Detected 2596.991 MHz processor.
(XEN) Initing memory sharing.
(XEN) Intel VT-d iommu 0 supported page sizes: 4kB, 2MB, 1GB.
(XEN) Intel VT-d iommu 1 supported page sizes: 4kB, 2MB, 1GB.
(XEN) Intel VT-d Snoop Control enabled.
(XEN) Intel VT-d Dom0 DMA Passthrough not enabled.
(XEN) Intel VT-d Queued Invalidation enabled.
(XEN) Intel VT-d Interrupt Remapping enabled.
(XEN) Intel VT-d Posted Interrupt not enabled.
(XEN) Intel VT-d Shared EPT tables enabled.
(XEN) I/O virtualisation enabled
(XEN)  - Dom0 mode: Relaxed
(XEN) Interrupt remapping enabled
(XEN) Enabled directed EOI with ioapic_ack_old on!
(XEN) ENABLING IO-APIC IRQs
(XEN) Allocated console ring of 64 KiB.
(XEN) VMX: Supported advanced features:
(XEN)  - APIC MMIO access virtualisation
(XEN)  - APIC TPR shadow
(XEN)  - Extended Page Tables (EPT)
(XEN)  - Virtual-Processor Identifiers (VPID)
(XEN)  - Virtual NMI
(XEN)  - MSR direct-access bitmap
(XEN)  - Unrestricted Guest
(XEN)  - APIC Register Virtualization
(XEN)  - Virtual Interrupt Delivery
(XEN)  - Posted Interrupt Processing
(XEN)  - VMCS shadowing
(XEN)  - VM Functions
(XEN) HVM: ASIDs enabled.
(XEN) VMX: Disabling executable EPT superpages due to CVE-2018-12207
(XEN) HVM: VMX enabled
(XEN) HVM: Hardware Assisted Paging (HAP) detected
(XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB
(XEN) CMCI: threshold 0x2 too large for CPU16 bank 17, using 0x1
(XEN) CMCI: threshold 0x2 too large for CPU16 bank 18, using 0x1
(XEN) CMCI: threshold 0x2 too large for CPU16 bank 19, using 0x1
(XEN) Brought up 32 CPUs
(XEN) mtrr: your CPUs had inconsistent variable MTRR settings
(XEN) Dom0 has maximum 840 PIRQs
(XEN)  Xen  kernel: 64-bit, lsb, compat32
(XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x30b3000
(XEN) PHYSICAL MEMORY ARRANGEMENT:
(XEN)  Dom0 alloc.:   0000001840000000->0000001844000000 (1029560
pages to be allocated)
(XEN)  Init. ramdisk: 000000187f5b8000->000000187ffffe20
(XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN)  Loaded kernel: ffffffff81000000->ffffffff830b3000
(XEN)  Init. ramdisk: 0000000000000000->0000000000000000
(XEN)  Phys-Mach map: 0000008000000000->0000008000800000
(XEN)  Start info:    ffffffff830b3000->ffffffff830b34b4
(XEN)  Xenstore ring: 0000000000000000->0000000000000000
(XEN)  Console ring:  0000000000000000->0000000000000000
(XEN)  Page tables:   ffffffff830b4000->ffffffff830d1000
(XEN)  Boot stack:    ffffffff830d1000->ffffffff830d2000
(XEN)  TOTAL:         ffffffff80000000->ffffffff83400000
(XEN)  ENTRY ADDRESS: ffffffff824f5180
(XEN) Dom0 has maximum 4 VCPUs
(XEN) Initial low memory virq threshold set at 0x4000 pages.
(XEN) Scrubbing Free RAM in background
(XEN) Std. Loglevel: Errors and warnings
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) ***************************************************
(XEN) Booted on L1TF-vulnerable hardware with SMT/Hyperthreading
(XEN) enabled.  Please assess your configuration and choose an
(XEN) explicit 'smt=<bool>' setting.  See XSA-273.
(XEN) ***************************************************
(XEN) Booted on MLPDS/MFBDS-vulnerable hardware with SMT/Hyperthreading
(XEN) enabled.  Mitigations will not be fully effective.  Please
(XEN) choose an explicit smt=<bool> setting.  See XSA-297.
(XEN) ***************************************************
(XEN) 3... 2... 1...
(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) Freed 500kB init memory
(XEN) TSC marked as reliable, warp = 0 (count=2)
(XEN) dom1: mode=0,ofs=0x914112976,khz=2596991,inc=1

> > nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null &
> How about trying iperf or iperf3 with either only transmit or receive? iperf 
> is specifically designed to use maximal bandwidth and doesn't use disk.
> http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/throughput-tool-comparision/

Noted, thank you!  I'll look at those tools now, and try them.

> For independently load-testing disk, you can try dd or fio, while being 
> cognizant of the disk cache. To avoid actual disk I/O I think you should be
> able to use a ram based disk in the dom0 instead of a physical disk. However, 
> I wouldn't bother if you can reproduce with network only, until the
> network issue has been fixed.

Acknowledged, thanks.

> > maxmem=90112
> > vcpus=26
> This is fairly large.
> Have you tried both fewer cpus and less memory? If you can reproduce with 
> iperf, which probably will reproduce more quickly, can you reproduce with
> memory=2048 and vcpus=1 or vcpus=2 for example? FYI the domU might not boot 
> at all with vcpus=1 with some kernel versions.

I... have not.... and please pardon my ignorance here, but my guest
machine runs a lot of different things for our client, and definitely
needs the RAM (and I *think* needs the CPUs, although I confess that
I'm not sure how vcpus translate to available compute power.)  I can
try the smaller numbers, but have not because to me it's off-point,
since my guest requires the larger number of resources we've
traditionally allocated.

> But I would try that only if none of the network changes show a difference.

Okay, understood.

> >          'rate=100Mb/s,mac=00:16:3f:49:4a:41,bridge=br0',
> You probably want try removing the vif rate limit. Using rate=... I got soft 
> lockups on the dom0 many kernel versions ago. I don't know what happens
> if the soft lockups in the dom0 have been fixed - perhaps another problem 
> remains in the domU.
> If removing "rate" fixes it, switch to rate limiting with another method - 
> possibly 'tc' but there might be something better available now using BPF.

Okay, will attempt that as a subsequent step.

> Also, have you tried at all looking at or changing the offload settings in 
> the dom0 and/or domU with "/sbin/ethtool -k/-K <device>" ? I don't think
> this is actually the issue. But it's been a source of problems historically 
> and it's easy to try.

In the past, I ran with, e.g. :

ethtool -K em1 rx off tx off sg off tso off ufo off gso off gro off lro off

on both the host and the guest.  The problems did occur after the
upgrade even with those settings.  I then stopped using them
(commented them out) on both host and guest - it made no material
difference that I could see - the guest still crashed, and had roughly
the same performance, either way.

> > I am looking for a means on Xen to bug report this; so far, I haven't
> > found it, but I will keep looking.
> https://wiki.xen.org/wiki/Reporting_Bugs_against_Xen_Project

Thank you!

> But try some more data collection and debugging first, ideally by changing 
> one thing at a time.

Understood, and will do.

>  > thoughts, guidance, musings, etc., anything at all would be
>  > appreciated.
> x-ref https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
> I don't get the impression you've tried using sysrq already, since you did 
> not mention it by name. If you have tried sysrq, it would be helpful if you
> could go back through your original email and add examples of all of the 
> commands you've run.

I have not.  The closest I got to this was the xl trigger nmi command,
which just (sometimes) brings the guest back to life.  Thank you for
that pointer!

> For PV, to send the sysrq you can try 'xl sysrq <domU> <key>' or 'ctrl-o 
> <key>' on the virtual serial console. Neither will probably work for HVM. I

Right.  We prefer PV, I tried HVM as a test, it failed, so we've stayed with PV.

> When the domU locks up, you *might* get interesting information from the 'x' 
> and 'l' sysrq commands within the domU, You may need to enable that
> functionality with 'sysctl -w kernel.sysrq=1' .
> I'm not sure the 'l' commands works for PV at all. 'l' works for HVM.

Okay, done.  I've enabled it on my ailing production guest and my test
guest, and will try those two commands on the next stall.

> If you can send a sysrq when the domU is not locked up, but can't send one 
> when it's locked up, that's also potentially interesting.

Okay noted.  I'm going to try it on the stalled guest first, and then
I'll try again immediately after I reboot said guest, and report.

So, going back to your "ideally by changing one thing at a time"
comment, here's kind of how I'm proceeding:

1. Just prior to sending my original email, I had booted the guest
with tsc_mode="always_emulate", and I am currently stress-testing it
with a large number of those tar jobs.  I'd already done this/started
them, so I'm going to let them run overnight (I'm on Pacific time).
I'd be surprised if this solves it... and I'm not going to wait past
tomorrow morning because I feel like downgrading Xen is a more
productive approach (see below), but who knows, it will be interesting
to see if the machine survives the night at least.

2. I've armed the sysrq on all machines, and will try that if either
guest crashes, and will capture and post any output I can get from
them.

3. The next thing I'm going to try - tomorrow morning - is taking Xen
down to 4.10.4 via an OpenSuse 15.0 install on my test Dom0, I'm then
going to boot the guest (in its default config without tsc_mode
overridden) and see if it runs reliably.  If it does, I'll report
that.

IF it does, I'm going to transfer my production guest to this host in
the hope that it becomes stable, but I can still continue testing
beyond that if it's desired here, by using the former production
machine as a test bed (since it has the same problems.)

Next steps after that, as I understand what you've said:

4. Take the physical host back to Xen 4.9.4, with the old default
4.4.180 kernel, test and report. (I'd expect this to work, as this is
the "old, pre-trouble" setup, but who knows.)
5. Take that host up to the 4.12 kernel with the old Xen 4.9.4, test and report.
6. Remove the rate limit, test and report.

Let me know if that's not right, or you'd like to see anything done differently.

And of course, my challenge here is simply that these stalls don't
happen immediately.  Under heavy load, they usuallly take just hours,
but might take days.  Given what I've seen, I don't personally think I
could call anything "solved" unless the guest survived under elevated
load for at least seven days.  So I will be performing each of these
steps, but it may take several days or more to report on each one.
But like I said, this is for a large client, and there is of course a
sense of... wanting to get this solved quickly... so I will do the
steps you've suggested and report on each one.

In the meantime, THANK YOU for your response, and if you or anyone
else has any other thoughts, please do send them to me!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.