Xen project Mailing List

[Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load

Date: Thu, 13 Feb 2020 13:07:41 -0800

Delivery-date: Thu, 13 Feb 2020 21:09:23 +0000

List-id: Xen user discussion <xen-users.lists.xenproject.org>

Dear Xen Team: Since upgrading to Xen 4.12, I'm experiencing an ongoing problem with stalled guests. I had previously thought I was the only one with this problem, and had reported this to my distro's virtualization team ( see: https://lists.opensuse.org/opensuse-virtual/2019-12/msg00000.html and https://lists.opensuse.org/opensuse-virtual/2019-12/msg00003.html for thread heads ), but although they tried to help, we all kind of concluded (wrongly) that I must just have had a bad guest. I finally recreated a new host and a new guest, clean, from scratch, thinking that would solve the problem, and it didn't. That led me to search again, and I now see that another individual (who doubtless thought HE was the only one) has reported this issue to your list ( https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html et al). So I'm sending this report here to let you all know that this problem is now no longer limited to one person, and it is reproducible, and is, in my opinion, severe. I hope that someone here can help, or point us in the right direction. Here we go: Problem: Xen DomU guests randomly stall under high network/disk loads. Dom0 is not affected. Randomly means anywhere between 1 hour and 14 days after guest boot - the time seems to shorten with (or perhaps the problem is triggered by) increased network (and possibly disk) activity. Symptoms on the DomU Guest: 1. Guest machine performs normally until the moment of failure. No abnormal log/console entries exist. 2. At the moment of failure, the guest's network goes offline. No abnormal log/console entries are written at that moment. 3. Processes which were trying to connect to the network start to consume increasing amounts of CPU. 4. Load average of the guest starts to increase, continuing upward without apparent bound. 5. If a high-priority bash shell is left logged in on the guest hvc0 console, some commands might still be runnable; most are not. 6. If the guest console is not logged in, the console is frozen and doesn't even echo characters. 7. Some guests will output messages on the console like this: kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s! 8. On some others, I will also see output like: BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s! 9. Sometimes there is no output at all on the console. Symptoms on the Dom0 Host: 1. The Dom0 Host is unaffected. The only indication anything is happening on the host are two log entries in /var/log/messages: vif vif-6-0 vif6.0: Guest Rx stalled br0: port 2(vif6.0) entered disabled state 2. Other guests are not affected (Although other guests too may stall at other random times, stalls on one guest do not seem to affect other guests directly.) Circumstances when the problem first occurred: 1. All hosts and guests were previously on Xen 4.9.4 (OpenSuse 42.3, Linux 4.4.180, Xen 4.9.4) 2. I upgraded one physical host to Xen 4.12.1 (OpenSuse 15.1, Linux 4.12.14, Xen 4.12.1). 3. The guest(s) on that host started malfunctioning at that point. Immediate steps taken while the guest was stalled, which did not help: 1. Tried to use high-priority shell on guest console to kill high-CPU processes; they were unkillable. 2. Tried to use guest console to stop and restart network; commands were unresponsive. 3. Tried to use guest console to shutdown/init 0. This caused console to be terminated, but guest would not otherwise shutdown. 4. Tried to use host xl interface to unplug/replug network bridges. This appeared to work from host side, but guest was unaffected. One thing which I accidentally discovered that *did* help: 1. Tried sending xl trigger nmi from the host to the guest. When I trigger the stalled guest with an NMI, I get its attention. The guest will print the following on the console: Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue In some cases (pattern not yet known), the guest will then immediately come back online: The network will come back online, and all processes will slowly stop consuming CPU, and things will return to normal. Existing network connections were obviously terminated, but new connections are accepted. In that case, it's like the guest just magically comes back to life. When this works, the host log shows: vif vif-6-0 vif6.0: Guest Rx ready br0: port 2(vif6.0) entered blocking state br0: port 2(vif6.0) entered forwarding state And all seems well... as if the guest had never stalled. However, this is not reliable. In most cases, the guest will print those messages, but the processes will NOT recover, and the network will come back impaired, or not at all. When that happens, repeated NMIs do not help: If the guest doesn't recover the first time, it doesn't recover at all. The *only* reliable way to recover is to destroy the guest completely, and recreate it. This is a hard destroy: the guest cannot shut itself down. The guest will then run fine... until the next stall. But of course a hard-destroy can't be a healthy thing for a guest machine, and that's really not a solution. Long-term mitigation steps which were tried which did not help. 1. Thought this was an SSH bug (since sshd processes were consuming high CPU), installed latest OpenSSH. 2. Though maybe a PV problem, tried under HVM instead of PV. 3. Noted a problem with grant frames, applied the recommended fix for that, my config now looks like: # xen-diag gnttab_query_size 0 # Domain-0 domid=0: nr_frames=1, max_nr_frames=64 # xen-diag gnttab_query_size 1 # Xenstore domid=1: nr_frames=4, max_nr_frames=4 # xen-diag gnttab_query_size 6 # My guest domid=6: nr_frames=17, max_nr_frames=256 4. Thought maybe a kernel module might be at issue, reviewed list with OpenSuse team, pruned modules. 5. Thought this might be a kernel mismatch, was referred to a new kernel by OpenSuse team (Linux 4.12.13 for OpenSuse 42.3). That changed some of the console output behavior and logging, but did not solve the problem. 6. Thought this might be a general OS mismatch, tried upgrading the guest to the same OS/Xen versions as the host (OpenSuse 15.1/Linux 4.12.14/Xen 4.12.1). In this configuration, no console or log output is generated on the guest at all, it just stalls. 7. Assumed (incorrectly, it now turns out) that something was just "wrong" with my guest, tried a fresh load of host, and a fresh guest. I thought that would solve it, but to my sadness, it did not. Which means that this is now a reproducible bug. Steps to reproduce: 1. Get a server. I'm using a Dell PowerEdge R720, but this has happened on several different Dell models. My current server has two 16-core CPUs, and 128GB of RAM. 2. Load Xen 4.12.1 (OpenSuse 15.1/Xen 4.12.1) on the server. Boot it up in Xen Dom0/host mode. 3. Create a new guest machine, also with 4.12.1. 4. Fire up the guest. 5. Put a lot of data on the guest (my guest has 3 TB of files and data). 6. Plug a crossover cable into your server, and plug the other end into some other Linux machine. 7. From that other machine, start pounding the guest. An rsync of the entire data partition is a great way to trigger this. If I run several outbound rsyncs together, I can crash my guest in under 48 hours. If I run 4 or 5, I can often crash the guest in just 2 hours. If you don't want to damage your SSDs on your other machine, here's my current command (my host is 192.168.1.10, and my guest is 192.168.1.11, so I plug in some other machine and make it, say, 192.168.1.12, and then run: nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null & Where /a is my directory full of user data. 4-6 of these running simultaneously will bring the guest to its knees in short order. On my most recent test, I did the NMI trigger thing, and found this in the guest's /var/log/messages after sending the trigger (I've removed tagging and timestamping for clarity:) Uhhuh. NMI received for unknown reason 00 on CPU 0. Do you have a strange powersaving mode enabled? Dazed and confused, but trying to continue clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'xen' wd_now: 58842b687eb3c wd_last: 55aa97ff29565 mask: ffffffffffffffff clocksource: 'tsc' cs_now: 58d3355ea9a87e cs_last: 585cca21d4f074 mask: ffffffffffffffff tsc: Marking TSC unstable due to clocksource watchdog BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=-20 stuck for 50117s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 pending: clocksource_watchdog_work, vmstat_shepherd, cache_reap workqueue mm_percpu_wq: flags=0x8 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 pending: vmstat_update workqueue writeback: flags=0x4e pwq 52: cpus=0-25 flags=0x4 nice=0 active=1/256 in-flight: 28593:wb_workfn workqueue kblockd: flags=0x18 pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256 pending: blk_mq_run_work_fn, blk_mq_timeout_work pool 52: cpus=0-25 flags=0x4 nice=0 hung=0s workers=3 idle: 32044 18125 That led me to search around, and I tripped over this: https://wiki.debian.org/Xen/Clocksource , which describes a guest hanging with the message "clocksource/0: Time went backwards/" Although I did not see this message, and this is not directly on point with OpenSuse (since our /proc structure doesn't include some of the switches mentioned), I did notice clocksource references in the logs (see above), and that led me back to: https://doc.opensuse.org/documentation/leap/virtualization/html/book.virt/cha-xen-manage.html, and specifically the tsc_mode setting. I have no idea if it's relevant, but I since I'm out of ideas and have nothing better to try, I have now booted my guest into tsc_mode=1 and am stress testing it to see if it fares any better this way. Administrivia: OS: OpenSuse 15.1 Linux: 4.12.14-lp151.28.36 Xen: 4.12.1 Dom0 boot parameters: dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin gnttab_max_frames=256 Xen guest config: name="guest1" description="guest1" memory=90112 maxmem=90112 vcpus=26 cpus="4-31" on_poweroff="destroy" on_reboot="restart" on_crash="restart" on_watchdog="restart" localtime=0 keymap="en-us" type="pv" kernel="/usr/lib/grub2/x86_64-xen/grub.xen" extra="elevator=noop" disk=[ '/xen/guest1/guest1.root,raw,xvda1,w', '/xen/guest1/guest1.swap,raw,xvda2,w', '/xen/guest1/guest1.xa,raw,xvda3,w', ] vif=[ 'rate=100Mb/s,mac=00:16:3f:49:4a:41,bridge=br0', ] vfb=['type=vnc,vncunused=1'] I had originally thought that I was the only person with this problem, and that's why I thought a fresh guest would fix it - the problem followed me around different servers, so that made sense. Over the past weeks I've set up a fresh guest on my fresh host, and, just on a whim, did the above stress testing on it... sadly, it only lasted for 36 hours. On the older Xen 4.9, I *never* encountered problems, and nothing changed other than OS/Xen versions when I did the upgrades to the new versions. Since I can now reproduce the problem on different hardware and setups, I thought I'd start my searches over again. To my relief, I found that, just in the past few weeks, another person has now reported what seems to be the same problem, only he reported it to this list (whereas I had sent my report to OpenSuse.) In his message, referenced above, he states that the problem is limited to Xen 4.12 and 4.13, and that rolling back to Xen 4.11 solves the problem. If that's right, there seems to be a *significant* problem somewhere, and it's clearly no longer just one instance. I am looking for a means on Xen to bug report this; so far, I haven't found it, but I will keep looking. Meanwhile, I'm hoping that these details and history spark something for some of you here. Do any of you have any ideas on this? Any thoughts, guidance, musings, etc., anything at all would be appreciated. Again, thank you all for your patience and help, I am very grateful! Glen _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.