[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load

Dear Xen Team:

Since upgrading to Xen 4.12, I'm experiencing an ongoing problem with
stalled guests.  I had previously thought I was the only one with this
problem, and had reported this to my distro's virtualization team (
see:  https://lists.opensuse.org/opensuse-virtual/2019-12/msg00000.html
 and  https://lists.opensuse.org/opensuse-virtual/2019-12/msg00003.html
for thread heads ), but although they tried to help, we all kind of
concluded (wrongly) that I must just have had a bad guest.

I finally recreated a new host and a new guest, clean, from scratch,
thinking that would solve the problem, and it didn't.  That led me to
search again, and I now see that another individual (who doubtless
thought HE was the only one) has reported this issue to your list  (
et al).

So I'm sending this report here to let you all know that this problem
is now no longer limited to one person, and it is reproducible, and
is, in my opinion, severe.  I hope that someone here can help, or
point us in the right direction.  Here we go:

Problem:  Xen DomU guests randomly stall under high network/disk
loads.  Dom0 is not affected.  Randomly means anywhere between 1 hour
and 14 days after guest boot - the time seems to shorten with (or
perhaps the problem is triggered by) increased network (and possibly
disk) activity.

Symptoms on the DomU Guest:
1. Guest machine performs normally until the moment of failure.  No
abnormal log/console entries exist.
2. At the moment of failure, the guest's network goes offline.  No
abnormal log/console entries are written at that moment.
3. Processes which were trying to connect to the network start to
consume increasing amounts of CPU.
4. Load average of the guest starts to increase, continuing upward
without apparent bound.
5. If a high-priority bash shell is left logged in on the guest hvc0
console, some commands might still be runnable; most are not.
6. If the guest console is not logged in, the console is frozen and
doesn't even echo characters.
7. Some guests will output messages on the console like this:
kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s!
8. On some others,  I will also see output like:
BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s!
9. Sometimes there is no output at all on the console.

Symptoms on the Dom0 Host:
1. The Dom0 Host is unaffected.  The only indication anything is
happening on the host are two log entries in /var/log/messages:
vif vif-6-0 vif6.0: Guest Rx stalled
br0: port 2(vif6.0) entered disabled state
2. Other guests are not affected (Although other guests too may stall
at other random times, stalls on one guest do not seem to affect other
guests directly.)

Circumstances when the problem first occurred:
1. All hosts and guests were previously on Xen 4.9.4 (OpenSuse 42.3,
Linux 4.4.180, Xen 4.9.4)
2. I upgraded one physical host to Xen 4.12.1 (OpenSuse 15.1, Linux
4.12.14, Xen 4.12.1).
3. The guest(s) on that host started malfunctioning at that point.

Immediate steps taken while the guest was stalled, which did not help:
1. Tried to use high-priority shell on guest console to kill high-CPU
processes; they were unkillable.
2. Tried to use guest console to stop and restart network; commands
were unresponsive.
3. Tried to use guest console to shutdown/init 0.  This caused console
to be terminated, but guest would not otherwise shutdown.
4. Tried to use host xl interface to unplug/replug network bridges.
This appeared to work from host side, but guest was unaffected.

One thing which I accidentally discovered that *did* help:
1. Tried sending xl trigger nmi from the host to the guest.

When I trigger the stalled guest with an NMI, I get its attention.
The guest will print the following on the console:

Uhhuh. NMI received for unknown reason 00 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

In some cases (pattern not yet known), the guest will then immediately
come back online:  The network will come back online, and all
processes will slowly stop consuming CPU, and things will return to
normal.   Existing network connections were obviously terminated, but
new connections are accepted.  In that case, it's like the guest just
magically comes back to life.

When this works, the host log shows:
vif vif-6-0 vif6.0: Guest Rx ready
br0: port 2(vif6.0) entered blocking state
br0: port 2(vif6.0) entered forwarding state

And all seems well... as if the guest had never stalled.

However, this is not reliable.  In most cases, the guest will print
those messages, but the processes will NOT recover, and the network
will come back impaired, or not at all.  When that happens, repeated
NMIs do not help:  If the guest doesn't recover the first time, it
doesn't recover at all.

The *only* reliable way to recover is to destroy the guest completely,
and recreate it.  This is a hard destroy:  the guest cannot shut
itself down.  The guest will then run fine... until the next stall.
But of course a hard-destroy can't be a healthy thing for a guest
machine, and that's really not a solution.

Long-term mitigation steps which were tried which did not help.
1. Thought this was an SSH bug (since sshd processes were consuming
high CPU), installed latest OpenSSH.
2. Though maybe a PV problem, tried under HVM instead of PV.
3. Noted a problem with grant frames, applied the recommended fix for
that, my config now looks like:
# xen-diag gnttab_query_size 0 # Domain-0
domid=0: nr_frames=1, max_nr_frames=64
# xen-diag gnttab_query_size 1 # Xenstore
domid=1: nr_frames=4, max_nr_frames=4
# xen-diag gnttab_query_size 6 # My guest
domid=6: nr_frames=17, max_nr_frames=256
4. Thought maybe a kernel module might be at issue, reviewed list with
OpenSuse team, pruned modules.
5. Thought this might be a kernel mismatch, was referred to a new
kernel by OpenSuse team (Linux 4.12.13 for OpenSuse 42.3).  That
changed some of the console output behavior and logging, but did not
solve the problem.
6. Thought this might be a general OS mismatch, tried upgrading the
guest to the same OS/Xen versions as the host (OpenSuse 15.1/Linux
4.12.14/Xen 4.12.1).  In this configuration, no console or log output
is generated on the guest at all, it just stalls.
7. Assumed (incorrectly, it now turns out) that something was just
"wrong" with my guest, tried a fresh load of host, and a fresh guest.
I thought that would solve it, but to my sadness, it did not.

Which means that this is now a reproducible bug.

Steps to reproduce:
1. Get a server.  I'm using a Dell PowerEdge R720, but this has
happened on several different Dell models.  My current server has two
16-core CPUs, and 128GB of RAM.
2. Load Xen 4.12.1 (OpenSuse 15.1/Xen 4.12.1) on the server.  Boot it
up in Xen Dom0/host mode.
3. Create a new guest machine, also with 4.12.1.
4. Fire up the guest.
5. Put a lot of data on the guest (my guest has 3 TB of files and data).
6. Plug a crossover cable into your server, and plug the other end
into some other Linux machine.
7. From that other machine, start pounding the guest.  An rsync of the
entire data partition is a great way to trigger this.  If I run
several outbound rsyncs together, I can crash my guest in under 48
hours.  If I run 4 or 5, I can often crash the guest in just 2 hours.
If you don't want to damage your SSDs on your other machine, here's my
current command (my host is, and my guest is, so I plug in some other machine and make it, say,, and then run:

nohup ssh tar cf - --one-file-system /a | cat > /dev/null &

Where /a is my directory full of user data.  4-6 of these running
simultaneously will bring the guest to its knees in short order.

On my most recent test, I did the NMI trigger thing, and found this in
the guest's /var/log/messages after sending the trigger (I've removed
tagging and timestamping for clarity:)

Uhhuh. NMI received for unknown reason 00 on CPU 0.
Do you have a strange powersaving mode enabled?
Dazed and confused, but trying to continue
clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc'
as unstable because the skew is too large:
clocksource: 'xen' wd_now: 58842b687eb3c wd_last: 55aa97ff29565 mask:
clocksource: 'tsc' cs_now: 58d3355ea9a87e cs_last: 585cca21d4f074
mask: ffffffffffffffff
tsc: Marking TSC unstable due to clocksource watchdog
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=-20 stuck for 50117s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
     pending: clocksource_watchdog_work, vmstat_shepherd, cache_reap
workqueue mm_percpu_wq: flags=0x8
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
     pending: vmstat_update
workqueue writeback: flags=0x4e
   pwq 52: cpus=0-25 flags=0x4 nice=0 active=1/256
     in-flight: 28593:wb_workfn
workqueue kblockd: flags=0x18
   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256
     pending: blk_mq_run_work_fn, blk_mq_timeout_work
pool 52: cpus=0-25 flags=0x4 nice=0 hung=0s workers=3 idle: 32044 18125

That led me to search around, and I tripped over this:
https://wiki.debian.org/Xen/Clocksource , which describes a guest
hanging with the message "clocksource/0: Time went backwards/"
Although I did not see this message, and this is not directly on point
with OpenSuse (since our /proc structure doesn't include some of the
switches mentioned), I did notice clocksource references in the logs
(see above), and that led me back to:
and specifically the tsc_mode setting.  I have no idea if it's
relevant, but I since I'm out of ideas and have nothing better to try,
I have now booted my guest into tsc_mode=1 and am stress testing it to
see if it fares any better this way.

OS: OpenSuse 15.1
Linux: 4.12.14-lp151.28.36
Xen: 4.12.1
Dom0 boot parameters: dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin
Xen guest config:


I had originally thought that I was the only person with this problem,
and that's why I thought a fresh guest would fix it - the problem
followed me around different servers, so that made sense.  Over the
past weeks I've set up a fresh guest on my fresh host, and, just on a
whim, did the above stress testing on it... sadly, it only lasted for
36 hours.  On the older Xen 4.9, I *never* encountered problems, and
nothing changed other than OS/Xen versions when I did the upgrades to
the new versions.

Since I can now reproduce the problem on different hardware and
setups, I thought I'd start my searches over again.  To my relief, I
found that, just in the past few weeks, another person has now
reported what seems to be the same problem, only he reported it to
this list (whereas I had sent my report to OpenSuse.)  In his message,
referenced above, he states that the problem is limited to Xen 4.12
and 4.13, and that rolling back to Xen 4.11 solves the problem.

If that's right, there seems to be a *significant* problem somewhere,
and it's clearly no longer just one instance.

I am looking for a means on Xen to bug report this; so far, I haven't
found it, but I will keep looking.

Meanwhile, I'm hoping that these details and history spark something
for some of you here.  Do any of you have any ideas on this?  Any
thoughts, guidance, musings, etc., anything at all would be

Again, thank you all for your patience and help, I am very grateful!


Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.