Re: [Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load

Hi Sarah -

On Fri, Feb 14, 2020 at 6:22 PM Sarah Newman <srn@xxxxxxxxx> wrote:
> I would personally guess it just means that something didn't get to run for a 
> long time. It might be worth using xl list / xl vcpu-list <domain> when
> it's hung to see if it's running or blocked and how many cpu times are going 
> up or not.

Okay, good.  I'm adding that to my list of other things to pull the
next time a guest freezes.  Thank you for this!

> Well, that gets you security support through December 2020.

With zero sarcasm intended, all I really want is for it to get me the
ability to sleep through the night.  :-)  I absolutely plan to keep
pushing on this as long as I can even if I get stability on 4.10 - so
I hope that by December 2020 we/they will have figured this out and
fixed it.  I'll take what I can get!  :-)

> I've gotten very useful data from debug builds of both Linux and Xen. It will 
> massively slow down your system and you don't want to run them in
> production.
> --Sarah

That also might be beyond my ability but I will try.

So here's where I am at right now.  Loosely speaking (and using Xen
version numbers since I'm on the Xen list) what I have is a 4.9
production guest, and a 4.9 hot backup guest, and a 4.12 test guest.
All are running on separate, Dell-based, 4.12 hosts.

I don't have the luxury of stress-testing the production guest, but I
don't need to:  It stalls every 3-5 days.  When it stalls, it's either
during the day, in which my "test time" is limited, or it's during the
middle of the night, in which case my cognitive ability is limited.
:-)   The next time it stalls, I'm going to do the sysrq things and
the xl list/vcpu-list and other stuff similar to it, and try to
capture it all.  I'll then post it here.

As a kind of informal fallback, today I downgraded the hot backup
guest's host to 4.9 again.  The downgrade went fine, that guest is now
running on 4.9 all the way, so it "should" never stall again.  As a
hot backup, it's traffic is much lower, so it's only stalled once,
this was really about just seeing if I could successfully downgrade
production if I needed to.

Today I also downgraded the test host to 4.10 (my only convenient
option <4.12).  I then launched the guest and started the stress
testing.  This next statement is interesting but otherwise useless:
Under the same amount of stress testing the guests' CPU load average
is about half what is was (hovering around 2-3, wheras before it was
hovering around 6.)  This is just for entertainment value only.

If Tomas' experience applies to me, this should mean that my test
guest will not stall anymore.  I am going to let it run for 7 days
under this load, and report back either at that time, or sooner if it

At that time, I'll also have a map to proceed.  If the guest survives,
I'm going to roll my client forward to this configuration, because
they need to be on the new OS for a number of other reasons.  So if
this proves to be "stable enough", we'll go forward.

If this guest does NOT survive, I'm going to downgrade the current
production host back to 4.9, putting us back to the place we were
before the trouble started.

Either way, I'll then be left with a pair of machines that are broken
(whichever pair my client "Leaves behind") and then I can start much
more aggressively testing everything you've asked me to - because my
client will be happy, and I will be able to (literally) sleep at

In the meantime, if anything else happens, I will report, and if you
or anyone has other thoughts, please tell me.  I definitely want to
get this resolved and fixed for everyone, to the extent I can help - I
just have to deal with the paying client first before I can turn back
to this completely.

(Test guest load average just touched 1.94 before coming back to 2.36
- no idea what this means but I have hope!)

Thank you thank you!


