[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load



On 2/14/20 9:00 AM, Glen wrote:
On Fri, Feb 14, 2020 at 12:07 AM Tomas Mozes <hydrapolic@xxxxxxxxx> wrote:
Hello Glen, thanks for your report.

Thank you!

The symptoms seem similar:
- xen 4.12
- 2 cpu
- high load
- dom0 is ok, domU stalls

Yes, same here.

I've just upgraded one of my machines to xen 4.12 (again) with sched=credit, 
I'll report back if it helps.

Thanks.  Right now I'm focused on tsc_mode, since it's what I was
working on before Sarah's responses yesterday.

My guest machine survived 24 hours under a very high load test using
tsc_mode="always_emulate" in the guest machine config.  I realize that
24 hours is hardly conclusive, but given Sarah's suggestion to try to
eliminate things quickly,

But how long does it take to reproduce under the original conditions?

If you want confidence that something staying up for 24 hours is a fix, you want a test that takes much less than 24 hours to reproduce the failure repeatedly under the original conditions.

If it takes 24 hours on average to reproduce, I would say you want to run for at least 48 hours, if not significantly longer, to have any confidence in a fix.

You could use some probability theory to put some better numbers here.

I've now switched the guest to what I think
is the opposite tsc_mode="native", and I'm trying it there for a
little while to see if it crashes.   (Neither of these are the default
- the default seems to be a hybrid of the two - but I will return and
test that once more if I can.)

If you haven't read the entirety of

https://xenbits.xen.org/docs/unstable/man/xen-tscmode.7.html

you probably want to then.


In addition to my normal test methods, I have at Sarah's suggestion
thrown in an iperf3 at maximum speed, continuous repeat from the host
to the guest, just to see if it helps stall the machine faster.   It's
pushing data at 14GBps right now, so we'll see.

If it's the vif rate limit which is the issue - that affects data outbound from the guest, not inbound to the guest. It's easy enough to confirm the direction by checking the interface tx/rx counters.

I was thinking that if it was network load, you might be able to reproduce within a few minutes using iperf, which would make subsequent testing much more easy.

--Sarah

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.