[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler



On 18.02.20 01:39, Glen wrote:
Hello Sander -

If I might chime in, I'm also experiencing what we believe is the same
problem, and hope I'm not breaking any protocol by sharing a few quick
details...

On Mon, Feb 17, 2020 at 3:46 PM Sander Eikelenboom <linux@xxxxxxxxxxxxxx> wrote:
On 17/02/2020 20:58, Sarah Newman wrote:
On 1/7/20 6:25 AM, Alastair Browne wrote:
So in conclusion, the tests indicate that credit2 might be unstable.
For the time being, we are using credit as the chosen scheduler. We
I don't think there are, but have there been any patches since the 4.13.0 
release which might have fixed problems with credit 2 scheduler? If not,
what would the next step be to isolating the problem - a debug build of Xen or 
something else?
If there are no merged or proposed fixes soon, it may be worth considering 
making the credit scheduler the default again until problems with the
credit2 scheduler are resolved.
I did take a look at Alastair Browne's report your replied to 
(https://lists.xen.org/archives/html/xen-devel/2020-01/msg00361.html)
and I do see some differences:
     - Alastair's machine has multiple sockets, my machines don't.
     - It seems Alastair's config is using ballooning ? 
(dom0_mem=4096M,max:16384M), for me that has been a source of trouble in the 
past, so my configs don't.

My configuration has ballooning disabled, we do not use it, and we
still have the problem.

     - kernel's tested are quite old (4.19.67 (latest upstream is 4.19.104), 
4.9.189 (latest upstream is 4.9.214)) and no really new kernel is tested
       (5.4 is available in Debian backport for buster).
     - Alastair, are you using pv, hvm or pvh guests? The report seems to miss 
the Guest configs (I'm primarily using PVH, and few HVM's, no PV except for 
dom0) ?

The problem appears to occur for both HVM and PV guests.

A report by Tomas
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
provides his config for his HVM setup.

My initial report
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html
contains my PV guest config.

Any how, could be worthwhile to test without ballooning, and test a recent 
kernel to rule out an issue with (missing) kernel backports.

Thanks to guidance from Sarah, we've had lots of discussion on the
users lists about this, especially this past week (pasting in
https://lists.xenproject.org/archives/html/xen-users/2020-02/ just for
your clicking convenience since I'm there as I type this) and it seems
like we've been able to narrow things down a bit:

* Alastair's config is on very large machines.  Tomas can duplicate
this on a much smaller scale, and I can duplicate it on a single DomU
running as the only guest on a Dom0 host.   So overall host
size/capacity doesn't seem to be very important, nor does number of
guests on the host.

* I'm using the Linux 4.12.14 kernel on both host and guest with Xen
4.12.1. - for me, the act of just going to a previous version of Xen
(in my case to Xen 4.10) eliminates the problem.  Tomas is on
4.14.159, and he reports that even moving back just to Xen 4.11
resolves his issue, whereas the issue seems to still exist in Xen
4.13.  So changing Xen versions without changing kernel versions seems
to resolve this.

* We've had another user mention that "When I switched to openSUSE Xen
4.13.0_04 packages with KernelStable (atm, 5.5.3-25.gd654690), Guests
of all 'flavors' became *much* better behaved.", so we think maybe
something in very recent Xen 4.13 might have helped (or possibly that
latest kernel, although from our limited point of view the changing of
Xen versions back to pre-4.12 solcing this without any kernel changes
seems compelling.)

* Tomas has already tested, and I am still testing, Xen 4.12 with just
the sched=credit change.  For him that has eliminated the problem as
well, I am still stress-testing my guest under Xen 4.12 sched=credit,
so I cannot report, but I am hopeful.

I believe this is why Sarah asked about patches to 4.13... it is
looking to us just on the user level like this is possibly
kernel-independent, but at least Xen-version-dependent, and likely
credit-scheduler-dependent.

I apologize if I should be doing something different here, but it is
looking like a few more of us are having what we believe to be the
same problem and, based only on what I've seen, I've already changed
over all of my production hosts (I run about 20) to sched=credit as a
precautionary measure.

Any thoughts, insights or guidance would be greatly appreciated!

Can you check whether all vcpus of a hanging guest are consuming time
(via xl vcpu-list) ?

It would be interesting to see where the vcpus are running around. Can
you please copy the domU's /boot/System.map-<kernel-version> to dom0
and then issue:

/usr/lib/xen/bin/xenctx -C -S -s <domu-system-map> <domid>

This should give a backtrace for all vcpus of <domid>. To recognize a
loop you should issue that multiple times.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.