[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Xen 4.12 DomU hang / freeze / stall under high network/disk load
Hi Sarah - Thank you for your email! On Fri, Feb 14, 2020 at 10:36 AM Sarah Newman <srn@xxxxxxxxx> wrote: > On 2/14/20 9:00 AM, Glen wrote: > > On Fri, Feb 14, 2020 at 12:07 AM Tomas Mozes <hydrapolic@xxxxxxxxx> wrote: > >> The symptoms seem similar: > >> - xen 4.12 > >> - 2 cpu > >> - high load > >> - dom0 is ok, domU stalls > > Yes, same here. > > My guest machine survived 24 hours under a very high load test using > > tsc_mode="always_emulate" in the guest machine config. I realize that > > 24 hours is hardly conclusive, but given Sarah's suggestion to try to > > eliminate things quickly, > But how long does it take to reproduce under the original conditions? > If you want confidence that something staying up for 24 hours is a fix, you > want a test that takes much less than 24 hours to reproduce the failure > repeatedly under the original conditions. > If it takes 24 hours on average to reproduce, I would say you want to run for > at least 48 hours, if not significantly longer, to have any confidence > in a fix. You're entirely correct - and you've put your finger on the problem, which is that I don't know how long it takes to reproduce. Since upgrading to Xen 4.12, *most* of my guests have no problem. The guests which do have a problem are my busiest guests, by which I mean the guests that are the most highly-used in terms of web traffic, FTP and rsync traffic, email list messages, and the like. If I do nothing, and just let all my guests run normally, my less-used guests never have a problem. My high-use guests will invariably stall, and will do so on average after about 4 days of use. However, I've seen them stall as quickly as 36 hours, and I've seen them last as long as 14 days. If I take a guest and stress test it, (meaning, I make a copy of my busy production guest, and put it up elsewhere, on an internal test network, so it's not getting any outside traffic at all, and I start 4-6 of those rsync jobs I mentioned previously), I've seen the guest stall in as little as 90 minutes, and last as long as 48 hours before stalling. So in my case, what I would want would be to find a solution under which one of my stress-test guests lasts for a full 7 days minimum, which would hopefully imply that normal guest would "forever" (for some value of "forever"). I'm going to add your iperf3 thing to my ongoing stress testing mix to see if it helps things go faster. Today, I was able to run it against a guest for two hours, and it did not make the guest crash. So I'll keep trying. > If you haven't read the entirety of > https://xenbits.xen.org/docs/unstable/man/xen-tscmode.7.html > you probably want to then. Done, thank you! And, again with the admission that I"m down to grasping at straws here, the reason I'm looking at this at all is because I'm trying *anything* I can to restore stability here. The guests make comments in their logs about "clocksource", and I have no idea if it's relevant or not, but when on that page I see phrases like "emulated means ... apps will always run correctly", I feel like that was an avenue worth checking. Right now, however, because it required a physical data center visit, I have proceeded to try to bracket this further. Today I downgraded Xen on my test host to 4.10, and am now stress-testing the guest again. I will have to let it run for 7 days to meet my arbitrary standard, unless it crashes further, so I now have to sit and wait for this, which is frustrating. I'll throw in iperf3 as well, but unless it crashes all I can do is wait. That's the problem here overall: There is no *pattern* to when a guest stalls. No apparent spike in load. No "length of time" before it goes. It's literally random... it just, literally.... stalls... like a jogger dying while sitting on the couch. > I was thinking that if it was network load, you might be able to reproduce > within a few minutes using iperf, which would make subsequent testing much > more easy. I will do as you say, and hope that it helps! Additional thoughts/comments/guidance welcome and wanted if it occurs to you or anyone! I will report back as soon as anything happens, or 7 days of uptime under stress pass. THANK YOU! Glen _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |