Re: [Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load

On Thu, Feb 13, 2020 at 7:06 PM Sarah Newman <srn@xxxxxxxxx> wrote:
> > I tried both xl network-detach followed by a network-attach (feeding
> > back in the parameters from my guest machine.)
> OK. Were you able to check if the network device went away in the domU? It 
> should have, but you won't see anything in dmesg necessarily.

Alas no.  The guest was unresponsive when I did this, and console
functionality was very limited (as in, "sync" might have worked, but
nothing else did.)  I can only report that the commands didn't seem to
help, or fix the problem, or have any visible impact on the guest.

> You could try the old scheduler:
> https://xenbits.xen.org/docs/unstable/features/sched_credit.html
> I am skeptical this is the problem, but you could try the old one.

Okay, noted, and added to my list.

> Anything about your setup that's out of the ordinary is a reasonable place to 
> start looking for problems. It may not solve your immediate issue but if
> it means a developer can reproduce, that gives you a chance of the bug 
> actually getting fixed.

Absolutely, and that's what I want.  I obviously want to solve my
immediate problem and just get my setup to be stable, even if that
means running on older Xen... but I take Xen seriously, and even if I
get my situation stabilized, I will still work on this as long as
anyone here wants to listen to me.  :-)

> I'd recommend you start by attempting to reproduce the problem as fast as 
> possible, with the setup as-is, before changing anything. 4 days is too long
> to have any certainty.

Right.  In this case, a failure becomes good (for debugging)

I've got 20 simultaneous tar processes running against my guest right
now (which is way more than I've ever needed or attempted - because
I'm grumpy and trying to do exactly what you say - make the guest
crash as fast as possible so I can eliminate possibilities), with that
tsc_mode="always_emulate" setting.  It's survived that for 10 hours so
far, which is far more than I expected.  I can't imagine that *that*
might solve this, but... I'll continue to watch it as long as I can,
and report either way in the morning to see how it does after 24

I can leave the host alone and test against that setting more, to see
if I can crash the guest without it faster (again) at the higher load.
I feel like downgrading to Xen 4.10 will probably fix *my* problem,
but mask *the* problem, and I really want both fixed.  :-)

> BTW, if it's the domU network load - you would probably reproduce fastest by 
> running testing between 2 domUs on the same dom0, if you can.

Not under my current setup, no.  This is a huge guest and it (almost)
maxes the host.  I've got a pair of servers committed to this already,
just for testing.  But even if I can solve my immediate issue, I'll
still have another pair (the current production group) of servers I
can mess with, and I'll have more flexibility to change things then.


