Re: [Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load

Anecdotal informational item for entertainment value only:

I had said previously that I had loaded about 2.5TB of data onto my
guest machine, and was testing by trying to rsync (and then tar'ing)
all that data from the test guest to an external source.   I was
running multiple jobs simultaneously to try to simulate the heavy load
that was clearly causing my production machines to stall

Under 4.12, my guests would crumple after 24-36 hours of this type of loading.

Under 4.10, my guest has been up now for 68 hours and - after about 64
hours, the above transfers *completed*.  This has never happened.  No
guest under 4.12 has ever survived 4+ simultaneous transfers of 2.5TB
of data since I encountered this problem.  They would all stall well
before the transfers could complete.  In contrast, under 4.10, my same
guest ran at (subjectively) about half the load average, and the
transfers all completed, in their entirely, without any stalls.

I have now restarted the testing, at 12 simultaneous transfers instead
of 4, plus Sarah's iperf3 suggestion thrown into the mix as well.
After 30 minutes, my guest is still showing a noticeably lower load
average than it did under 4.12 with just 4 simultaneous transfers.  I
will report on how this goes.

I understand that none of this is particularly objective data.  I'm
sending it only in the hopes that it sparks something for someone
while we wait.  If this guest continues to survive high stress testing
under 4.10 - and I'm starting to have hope that it will - I'm moving
my client over to it this weekend.  Then, starting next week, I'll be
able to do the directed, specific testing Sarah and others suggested,
without any business-side time pressure.

But it seems clear to me that there really *is* a problem in 4.12,
where guests seem (still subjectively so far) to stall, and certainly
run slower?  at a higher load?   (for some value of slower TBD) than
they do under earlier Xen versions.    I agree that we need more data,
and I'm going to get it - and I hope that any others out there
experiencing this will chime in (I know Tomas is working this too)
because we need to bracket this well enough to file a useful bug
report so developers can get this fixed.

More to follow - thank you all for your ongoing attention and support.


