Xen project Mailing List

Re: [Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load

Date: Thu, 13 Feb 2020 18:26:56 -0800

Cc: Xen-users <xen-users@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 14 Feb 2020 02:28:36 +0000

List-id: Xen user discussion <xen-users.lists.xenproject.org>

Hi Sarah! On Thu, Feb 13, 2020 at 4:28 PM Sarah Newman <srn@xxxxxxxxx> wrote: > I'm not one of the Xen developers, Thank you so much for responding, and for the many clarifications and pointers! I am very grateful for your time. Everything is very well-taken and very much appreciated. > I wouldn't necessarily assume the same root cause from the information you've > provided so far. Okay, understood. This has been plaguing me since I first upgraded my first host to 4.12, and I totally concede that I could well be "grasping at straws." This is a production host for a large client, and it's to the point where alarms wake me up every 4-5 nights now, so I am... very interested in finding a solution here. (I now have nightmares about my phone ringing, even when it isn't. Ugh.) > > 2. I upgraded one physical host to Xen 4.12.1 (OpenSuse 15.1, Linux > > 4.12.14, Xen 4.12.1). > > 3. The guest(s) on that host started malfunctioning at that point. > If you can, try Xen 4.9.4 with Linux 4.12.14 (or Xen 4.12.1 with Linux > 4.4.180.) > That will help isolate the issue to either Xen or the Linux kernel. Understood. Tomorrow (Pacific time) I had already planned to change the physical host (via fresh reload) to OpenSuse 15.0 (in between these two releases) which has Linux 4.12.14 and Xen 4.10.4 (so pre-4.12). I'll do that as an interim step on your plan, and I will report how that goes. If it fails as well, I will take the physical host back to 42.3 (Xen 4.9.4) and install the 4.12 kernel there, and report. > > 4. Tried to use host xl interface to unplug/replug network bridges. > > This appeared to work from host side, but guest was unaffected. > Do you mean 'xl network-detach' and 'xl network-attach'? If not, please give > example commands. I tried both xl network-detach followed by a network-attach (feeding back in the parameters from my guest machine.) I also tried using brctl to remove the VIF from the bridge and re-add it, as in: brctl dellif br0 vif6.0 / brctl addif br0 vif6.0 Neither had any effect on the guest in trouble. > > 1. Get a server. I'm using a Dell PowerEdge R720, but this has > What CPUs are these? Can you dump the information from one of the cpus in > /proc/cpuinfo so we can see what microcode version you have, in the highly > unlikely case this information is pertinent? My pleasure. This problem has happened on several different physical hosts, I will dump for two of them: My testing server: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 63 model name : Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz stepping : 2 microcode : 0x43 cpu MHz : 2596.990 cache size : 20480 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms xsaveopt md_clear bugs : null_seg cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit bogomips : 5193.98 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: The production machine where this first occurred (and continues to occur): processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz stepping : 7 microcode : 0x710 cpu MHz : 2100.014 cache size : 20480 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm xsaveopt bugs : null_seg cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit bogomips : 4200.02 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: > What about attaching the output of 'xl dmesg' - both the initial boot > messages and anything that comes from running the specific domU? I'll post just from my test machine, skipping the pretty ASCII art - let me know if that's not right, or if you want to see a second machine as well. (XEN) Xen version 4.12.1_06-lp151.2.9 (abuild@xxxxxxx) (gcc (SUSE Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]) debug=n Fri Dec 6 16:56:43 UTC 2019 (XEN) Latest ChangeSet: (XEN) Bootloader: GRUB2 2.02 (XEN) Command line: dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin gnttab_max_frames=256 vga=gfx-1024x768x16 (XEN) Xen image load base address: 0 (XEN) Video information: (XEN) VGA is graphics mode 1024x768, 16 bpp (XEN) VBE/DDC methods: V2; EDID transfer time: 1 seconds (XEN) Disc information: (XEN) Found 2 MBR signatures (XEN) Found 2 EDD information structures (XEN) Xen-e820 RAM map: (XEN) 0000000000000000 - 000000000009c000 (usable) (XEN) 000000000009c000 - 00000000000a0000 (reserved) (XEN) 00000000000e0000 - 0000000000100000 (reserved) (XEN) 0000000000100000 - 000000007a289000 (usable) (XEN) 000000007a289000 - 000000007af0b000 (reserved) (XEN) 000000007af0b000 - 000000007b93b000 (ACPI NVS) (XEN) 000000007b93b000 - 000000007bab8000 (ACPI data) (XEN) 000000007bab8000 - 000000007bae9000 (usable) (XEN) 000000007bae9000 - 000000007baff000 (ACPI data) (XEN) 000000007baff000 - 000000007bb00000 (usable) (XEN) 000000007bb00000 - 0000000090000000 (reserved) (XEN) 00000000feda8000 - 00000000fedac000 (reserved) (XEN) 00000000ff310000 - 0000000100000000 (reserved) (XEN) 0000000100000000 - 0000001880000000 (usable) (XEN) New Xen image base address: 0x79c00000 (XEN) ACPI: RSDP 000FE320, 0024 (r2 DELL ) (XEN) ACPI: XSDT 7BAB60E8, 00BC (r1 DELL PE_SC3 0 1000013) (XEN) ACPI: FACP 7BAB2000, 00F4 (r4 DELL PE_SC3 0 DELL 1) (XEN) ACPI: DSDT 7BA9C000, EACD (r2 DELL PE_SC3 3 DELL 1) (XEN) ACPI: FACS 7B8F3000, 0040 (XEN) ACPI: MCEJ 7BAB5000, 0130 (r1 INTEL 1 INTL 100000D) (XEN) ACPI: WD__ 7BAB4000, 0134 (r1 DELL PE_SC3 1 DELL 1) (XEN) ACPI: SLIC 7BAB3000, 0024 (r1 DELL PE_SC3 1 DELL 1) (XEN) ACPI: HPET 7BAB1000, 0038 (r1 DELL PE_SC3 1 DELL 1) (XEN) ACPI: APIC 7BAB0000, 0AFC (r2 DELL PE_SC3 0 DELL 1) (XEN) ACPI: MCFG 7BAAF000, 003C (r1 DELL PE_SC3 1 DELL 1) (XEN) ACPI: MSCT 7BAAE000, 0090 (r1 DELL PE_SC3 1 DELL 1) (XEN) ACPI: SLIT 7BAAD000, 006C (r1 DELL PE_SC3 1 DELL 1) (XEN) ACPI: SRAT 7BAAB000, 1130 (r3 DELL PE_SC3 1 DELL 1) (XEN) ACPI: SSDT 7B959000, 1424A9 (r2 DELL PE_SC3 4000 INTL 20121114) (XEN) ACPI: SSDT 7B956000, 217F (r2 DELL PE_SC3 2 INTL 20121114) (XEN) ACPI: SSDT 7B955000, 006E (r2 DELL PE_SC3 2 INTL 20121114) (XEN) ACPI: PRAD 7B954000, 0132 (r2 DELL PE_SC3 2 INTL 20121114) (XEN) ACPI: DMAR 7BAFE000, 00F8 (r1 DELL PE_SC3 1 DELL 1) (XEN) ACPI: HEST 7BAFD000, 017C (r1 DELL PE_SC3 2 DELL 1) (XEN) ACPI: BERT 7BAFC000, 0030 (r1 DELL PE_SC3 2 DELL 1) (XEN) ACPI: ERST 7BAFB000, 0230 (r1 DELL PE_SC3 2 DELL 1) (XEN) ACPI: EINJ 7BAFA000, 0150 (r1 DELL PE_SC3 2 DELL 1) (XEN) System RAM: 98210MB (100567388kB) (XEN) Domain heap initialised DMA width 32 bits (XEN) ACPI: 32/64X FACS address mismatch in FADT - 7b8f3000/0000000000000000, using 32 (XEN) IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23 (XEN) IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-47 (XEN) IOAPIC[2]: apic_id 10, version 32, address 0xfec40000, GSI 48-71 (XEN) Enabling APIC mode: Phys. Using 3 I/O APICs (XEN) Not enabling x2APIC (upon firmware request) (XEN) xstate: size: 0x340 and states: 0x7 (XEN) CMCI: threshold 0x2 too large for CPU0 bank 17, using 0x1 (XEN) CMCI: threshold 0x2 too large for CPU0 bank 18, using 0x1 (XEN) CMCI: threshold 0x2 too large for CPU0 bank 19, using 0x1 (XEN) Speculative mitigation facilities: (XEN) Hardware features: IBRS/IBPB STIBP L1D_FLUSH SSBD MD_CLEAR (XEN) Compiled-in support: INDIRECT_THUNK SHADOW_PAGING (XEN) Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: IBRS- SSBD-, Other: IBPB L1D_FLUSH VERW (XEN) L1TF: believed vulnerable, maxphysaddr L1D 46, CPUID 46, Safe address 300000000000 (XEN) Support for HVM VMs: MSR_SPEC_CTRL RSB EAGER_FPU MD_CLEAR (XEN) Support for PV VMs: MSR_SPEC_CTRL RSB EAGER_FPU MD_CLEAR (XEN) XPTI (64-bit PV only): Dom0 enabled, DomU enabled (with PCID) (XEN) PV L1TF shadowing: Dom0 disabled, DomU enabled (XEN) Using scheduler: SMP Credit Scheduler rev2 (credit2) (XEN) Initializing Credit2 scheduler (XEN) Platform timer is 14.318MHz HPET (XEN) Detected 2596.991 MHz processor. (XEN) Initing memory sharing. (XEN) Intel VT-d iommu 0 supported page sizes: 4kB, 2MB, 1GB. (XEN) Intel VT-d iommu 1 supported page sizes: 4kB, 2MB, 1GB. (XEN) Intel VT-d Snoop Control enabled. (XEN) Intel VT-d Dom0 DMA Passthrough not enabled. (XEN) Intel VT-d Queued Invalidation enabled. (XEN) Intel VT-d Interrupt Remapping enabled. (XEN) Intel VT-d Posted Interrupt not enabled. (XEN) Intel VT-d Shared EPT tables enabled. (XEN) I/O virtualisation enabled (XEN) - Dom0 mode: Relaxed (XEN) Interrupt remapping enabled (XEN) Enabled directed EOI with ioapic_ack_old on! (XEN) ENABLING IO-APIC IRQs (XEN) Allocated console ring of 64 KiB. (XEN) VMX: Supported advanced features: (XEN) - APIC MMIO access virtualisation (XEN) - APIC TPR shadow (XEN) - Extended Page Tables (EPT) (XEN) - Virtual-Processor Identifiers (VPID) (XEN) - Virtual NMI (XEN) - MSR direct-access bitmap (XEN) - Unrestricted Guest (XEN) - APIC Register Virtualization (XEN) - Virtual Interrupt Delivery (XEN) - Posted Interrupt Processing (XEN) - VMCS shadowing (XEN) - VM Functions (XEN) HVM: ASIDs enabled. (XEN) VMX: Disabling executable EPT superpages due to CVE-2018-12207 (XEN) HVM: VMX enabled (XEN) HVM: Hardware Assisted Paging (HAP) detected (XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB (XEN) CMCI: threshold 0x2 too large for CPU16 bank 17, using 0x1 (XEN) CMCI: threshold 0x2 too large for CPU16 bank 18, using 0x1 (XEN) CMCI: threshold 0x2 too large for CPU16 bank 19, using 0x1 (XEN) Brought up 32 CPUs (XEN) mtrr: your CPUs had inconsistent variable MTRR settings (XEN) Dom0 has maximum 840 PIRQs (XEN) Xen kernel: 64-bit, lsb, compat32 (XEN) Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x30b3000 (XEN) PHYSICAL MEMORY ARRANGEMENT: (XEN) Dom0 alloc.: 0000001840000000->0000001844000000 (1029560 pages to be allocated) (XEN) Init. ramdisk: 000000187f5b8000->000000187ffffe20 (XEN) VIRTUAL MEMORY ARRANGEMENT: (XEN) Loaded kernel: ffffffff81000000->ffffffff830b3000 (XEN) Init. ramdisk: 0000000000000000->0000000000000000 (XEN) Phys-Mach map: 0000008000000000->0000008000800000 (XEN) Start info: ffffffff830b3000->ffffffff830b34b4 (XEN) Xenstore ring: 0000000000000000->0000000000000000 (XEN) Console ring: 0000000000000000->0000000000000000 (XEN) Page tables: ffffffff830b4000->ffffffff830d1000 (XEN) Boot stack: ffffffff830d1000->ffffffff830d2000 (XEN) TOTAL: ffffffff80000000->ffffffff83400000 (XEN) ENTRY ADDRESS: ffffffff824f5180 (XEN) Dom0 has maximum 4 VCPUs (XEN) Initial low memory virq threshold set at 0x4000 pages. (XEN) Scrubbing Free RAM in background (XEN) Std. Loglevel: Errors and warnings (XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings) (XEN) *************************************************** (XEN) Booted on L1TF-vulnerable hardware with SMT/Hyperthreading (XEN) enabled. Please assess your configuration and choose an (XEN) explicit 'smt=<bool>' setting. See XSA-273. (XEN) *************************************************** (XEN) Booted on MLPDS/MFBDS-vulnerable hardware with SMT/Hyperthreading (XEN) enabled. Mitigations will not be fully effective. Please (XEN) choose an explicit smt=<bool> setting. See XSA-297. (XEN) *************************************************** (XEN) 3... 2... 1... (XEN) Xen is relinquishing VGA console. (XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input) (XEN) Freed 500kB init memory (XEN) TSC marked as reliable, warp = 0 (count=2) (XEN) dom1: mode=0,ofs=0x914112976,khz=2596991,inc=1 > > nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null & > How about trying iperf or iperf3 with either only transmit or receive? iperf > is specifically designed to use maximal bandwidth and doesn't use disk. > http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/throughput-tool-comparision/ Noted, thank you! I'll look at those tools now, and try them. > For independently load-testing disk, you can try dd or fio, while being > cognizant of the disk cache. To avoid actual disk I/O I think you should be > able to use a ram based disk in the dom0 instead of a physical disk. However, > I wouldn't bother if you can reproduce with network only, until the > network issue has been fixed. Acknowledged, thanks. > > maxmem=90112 > > vcpus=26 > This is fairly large. > Have you tried both fewer cpus and less memory? If you can reproduce with > iperf, which probably will reproduce more quickly, can you reproduce with > memory=2048 and vcpus=1 or vcpus=2 for example? FYI the domU might not boot > at all with vcpus=1 with some kernel versions. I... have not.... and please pardon my ignorance here, but my guest machine runs a lot of different things for our client, and definitely needs the RAM (and I *think* needs the CPUs, although I confess that I'm not sure how vcpus translate to available compute power.) I can try the smaller numbers, but have not because to me it's off-point, since my guest requires the larger number of resources we've traditionally allocated. > But I would try that only if none of the network changes show a difference. Okay, understood. > > 'rate=100Mb/s,mac=00:16:3f:49:4a:41,bridge=br0', > You probably want try removing the vif rate limit. Using rate=... I got soft > lockups on the dom0 many kernel versions ago. I don't know what happens > if the soft lockups in the dom0 have been fixed - perhaps another problem > remains in the domU. > If removing "rate" fixes it, switch to rate limiting with another method - > possibly 'tc' but there might be something better available now using BPF. Okay, will attempt that as a subsequent step. > Also, have you tried at all looking at or changing the offload settings in > the dom0 and/or domU with "/sbin/ethtool -k/-K <device>" ? I don't think > this is actually the issue. But it's been a source of problems historically > and it's easy to try. In the past, I ran with, e.g. : ethtool -K em1 rx off tx off sg off tso off ufo off gso off gro off lro off on both the host and the guest. The problems did occur after the upgrade even with those settings. I then stopped using them (commented them out) on both host and guest - it made no material difference that I could see - the guest still crashed, and had roughly the same performance, either way. > > I am looking for a means on Xen to bug report this; so far, I haven't > > found it, but I will keep looking. > https://wiki.xen.org/wiki/Reporting_Bugs_against_Xen_Project Thank you! > But try some more data collection and debugging first, ideally by changing > one thing at a time. Understood, and will do. > > thoughts, guidance, musings, etc., anything at all would be > > appreciated. > x-ref https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html > I don't get the impression you've tried using sysrq already, since you did > not mention it by name. If you have tried sysrq, it would be helpful if you > could go back through your original email and add examples of all of the > commands you've run. I have not. The closest I got to this was the xl trigger nmi command, which just (sometimes) brings the guest back to life. Thank you for that pointer! > For PV, to send the sysrq you can try 'xl sysrq <domU> <key>' or 'ctrl-o > <key>' on the virtual serial console. Neither will probably work for HVM. I Right. We prefer PV, I tried HVM as a test, it failed, so we've stayed with PV. > When the domU locks up, you *might* get interesting information from the 'x' > and 'l' sysrq commands within the domU, You may need to enable that > functionality with 'sysctl -w kernel.sysrq=1' . > I'm not sure the 'l' commands works for PV at all. 'l' works for HVM. Okay, done. I've enabled it on my ailing production guest and my test guest, and will try those two commands on the next stall. > If you can send a sysrq when the domU is not locked up, but can't send one > when it's locked up, that's also potentially interesting. Okay noted. I'm going to try it on the stalled guest first, and then I'll try again immediately after I reboot said guest, and report. So, going back to your "ideally by changing one thing at a time" comment, here's kind of how I'm proceeding: 1. Just prior to sending my original email, I had booted the guest with tsc_mode="always_emulate", and I am currently stress-testing it with a large number of those tar jobs. I'd already done this/started them, so I'm going to let them run overnight (I'm on Pacific time). I'd be surprised if this solves it... and I'm not going to wait past tomorrow morning because I feel like downgrading Xen is a more productive approach (see below), but who knows, it will be interesting to see if the machine survives the night at least. 2. I've armed the sysrq on all machines, and will try that if either guest crashes, and will capture and post any output I can get from them. 3. The next thing I'm going to try - tomorrow morning - is taking Xen down to 4.10.4 via an OpenSuse 15.0 install on my test Dom0, I'm then going to boot the guest (in its default config without tsc_mode overridden) and see if it runs reliably. If it does, I'll report that. IF it does, I'm going to transfer my production guest to this host in the hope that it becomes stable, but I can still continue testing beyond that if it's desired here, by using the former production machine as a test bed (since it has the same problems.) Next steps after that, as I understand what you've said: 4. Take the physical host back to Xen 4.9.4, with the old default 4.4.180 kernel, test and report. (I'd expect this to work, as this is the "old, pre-trouble" setup, but who knows.) 5. Take that host up to the 4.12 kernel with the old Xen 4.9.4, test and report. 6. Remove the rate limit, test and report. Let me know if that's not right, or you'd like to see anything done differently. And of course, my challenge here is simply that these stalls don't happen immediately. Under heavy load, they usuallly take just hours, but might take days. Given what I've seen, I don't personally think I could call anything "solved" unless the guest survived under elevated load for at least seven days. So I will be performing each of these steps, but it may take several days or more to report on each one. But like I said, this is for a large client, and there is of course a sense of... wanting to get this solved quickly... so I will do the steps you've suggested and report on each one. In the meantime, THANK YOU for your response, and if you or anyone else has any other thoughts, please do send them to me! Glen _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.