On 02/01/14 06:46, Manohar Vanga wrote:
Hi all,
I've spent the last few weeks trying to debug a weird issue
with a new scheduler I'm developing for Xen. I have written a
barebones round-robin scheduler which seems to work fine when
starting up Dom0, but then at some point during the boot
everything just hangs (somewhat deterministically from what I
can tell from a week of debugging; see below).
I've inlined my source code below. I don't expect anyone to
read the whole thing (although it's quite minimal) so here are
the key points:
- I've implemented the following callbacks: init_domain,
destroy_domain, insert_vcpu, remove_vcpu, sleep, wake,
yield, pick_cpu, do_schedule, init, deinit, alloc_vdata,
free_vdata, alloc_pdata, free_pdata, alloc_domdata,
free_domdata. Most of these are minimal (or in some cases
do nothing). Am I missing anything critical?
- The hang occurs even if I'm running Dom0 with just a
single vcpu. Nothing hangs if I choose a stock scheduler.
Either I'm doing something foolish that is causing a
deadlock (less likely since the code structure is borrowed
from sched_credit.c) or I'm *not* doing something leading
to Dom0 crashing and the vcpu just dying.
If you do suspect some specific issue please let me know.
Below are some of the possible issues that I've investigated
but hit dead ends on:
- Checking if my debug printk statements were leading to a
deadlock due to sleeps in interrupt mode. This doesn't
seem to be the case since Dom0 hangs during boot even if I
disable all debug output.
- I suspected incorrect queuing operations that might be
corrupting memory somewhere. However, my debug logs tell
me that this is not the case. There is at most one element
in the runqueue at all times (I use Dom0 with 1 vcpu).
- I also suspected a deadlock due to incorrect locking.
However, based on what the credit scheduler does in
sched_credit.c, I'm don't seem to be doing anything
significantly different. In general though, which
callbacks run in interrupt context?
- In the end, I stuck debug statements in tick_suspend and
tick_resume and after the hang, those get called
infinitely which seems like the physical CPU has gone
idle. Is this correct? In that case, *what am I doing
wrong in the scheduler* to cause Dom0 to crash?
- The hang occurs around 3-5 seconds into the boot process
quite deterministically. Could it be some periodic timer
going off and bugging with my code in weird and wonderful
ways?
Also, how do the sleep/wake/yield callbacks work? When do
they get called? Is there any documentation on the different
callbacks with regards to when they are called? If I
understand everything correctly after this, I would gladly
create a wiki page explaining this (and perhaps a tutorial
on writing a simple scheduler; something I wish existed!).
I hope the description was enough to help understand my
problem. If not, feel free to ask for more details :-)
Thanks for reading this far! Source code follows
Using printk()s in the code is going to skew the timing terribly.
A serial console and the 'q' debug key is probably a good start, to
see some vcpu state.
'watchdog' on the Xen command line will enable NMI watchdogs which
will catch deadlocks, but as I don't see a single use of spinlocks
in your code, I doubt this is your issue.
Beyond that, writing a custom keyhandler to dump all of the xfair
state is probably the next thing to try.
~Andrew
|