[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load



Hi Sarah -

Thank you for your email!

On Fri, Feb 14, 2020 at 10:36 AM Sarah Newman <srn@xxxxxxxxx> wrote:
> On 2/14/20 9:00 AM, Glen wrote:
> > On Fri, Feb 14, 2020 at 12:07 AM Tomas Mozes <hydrapolic@xxxxxxxxx> wrote:
> >> The symptoms seem similar:
> >> - xen 4.12
> >> - 2 cpu
> >> - high load
> >> - dom0 is ok, domU stalls
> > Yes, same here.
> > My guest machine survived 24 hours under a very high load test using
> > tsc_mode="always_emulate" in the guest machine config.  I realize that
> > 24 hours is hardly conclusive, but given Sarah's suggestion to try to
> > eliminate things quickly,
> But how long does it take to reproduce under the original conditions?
> If you want confidence that something staying up for 24 hours is a fix, you 
> want a test that takes much less than 24 hours to reproduce the failure
> repeatedly under the original conditions.
> If it takes 24 hours on average to reproduce, I would say you want to run for 
> at least 48 hours, if not significantly longer, to have any confidence
> in a fix.

You're entirely correct - and you've put your finger on the problem,
which is that I don't know how long it takes to reproduce.

Since upgrading to Xen 4.12, *most* of my guests have no problem.  The
guests which do have a problem are my busiest guests, by which I mean
the guests that are the most highly-used in terms of web traffic, FTP
and rsync traffic, email list messages, and the like.

If I do nothing, and just let all my guests run normally, my less-used
guests never have a problem.  My high-use guests will invariably
stall, and will do so on average after about 4 days of use.  However,
I've seen them stall as quickly as 36 hours, and I've seen them last
as long as 14 days.

If I take a guest and stress test it, (meaning, I make a copy of my
busy production guest, and put it up elsewhere, on an internal test
network, so it's not getting any outside traffic at all, and I start
4-6 of those rsync jobs I mentioned previously), I've seen the guest
stall in as little as 90 minutes, and last as long as 48 hours before
stalling.

So in my case, what I would want would be to find a solution under
which one of my stress-test guests lasts for a full 7 days minimum,
which would hopefully imply that normal guest would "forever" (for
some value of "forever").

I'm going to add your iperf3 thing to my ongoing stress testing mix to
see if it helps things go faster.  Today, I was able to run it against
a guest for two hours, and it did not make the guest crash.  So I'll
keep trying.

> If you haven't read the entirety of
> https://xenbits.xen.org/docs/unstable/man/xen-tscmode.7.html
> you probably want to then.

Done, thank you!  And, again with the admission that I"m down to
grasping at straws here, the reason I'm looking at this at all is
because I'm trying *anything* I can to restore stability here.  The
guests make comments in their logs about "clocksource", and I have no
idea if it's relevant or not, but when on that page I see phrases like
"emulated means ... apps will always run correctly", I feel like that
was an avenue worth checking.

Right now, however, because it required a physical data center visit,
I have proceeded to try to bracket this further.  Today I downgraded
Xen on my test host to 4.10, and am now stress-testing the guest
again.  I will have to let it run for 7 days to meet my arbitrary
standard, unless it crashes further, so I now have to sit and wait for
this, which is frustrating.  I'll throw in iperf3 as well, but unless
it crashes all I can do is wait.

That's the problem here overall:  There is no *pattern* to when a
guest stalls.   No apparent spike in load.  No "length of time" before
it goes.  It's literally random... it just, literally.... stalls...
like a jogger dying while sitting on the couch.

> I was thinking that if it was network load, you might be able to reproduce 
> within a few minutes using iperf, which would make subsequent testing much
> more easy.

I will do as you say, and hope that it helps!

Additional thoughts/comments/guidance welcome and wanted if it occurs
to you or anyone!

I will report back as soon as anything happens, or 7 days of uptime
under stress pass.

THANK YOU!
Glen

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.