[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split



Am 16.02.2011 15:11, schrieb Juergen Gross:
On 02/16/11 14:54, George Dunlap wrote:
Andre (and Juergen), can you try again with the attached patch?
George, Juergen, thanks for all your work on this!
I will try the patch as soon as I am back in the office today afternoon.

Regards,
Andre.


What the patch basically does is try to make "cpu_disable_scheduler()"
do what it seems to say it does. :-)  Namely, the various
scheduler-related interrutps (both per-cpu ticks and the master tick)
is a part of the scheduler, so disable them before doing anything, and
don't enable them until the cpu is really ready to go again.

To be precise:
* cpu_disable_scheduler() disables ticks
* scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
and does it after inserting the idle vcpu
* Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
stop tickers
   + Call tick_{resume,suspend} in cpu_{up,down}, respectively

I tried this before :-)
It didn't work for Andre, but may be there were some bits missing.

* Modify credit1's tick_{suspend,resume} to handle the master ticker as well.

With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
on one pcpu), I can perform thousands of operations successfully.

Nice. I'll try later. In the moment I'm testing another patch (attached
for review, if you like). I think I've identified two possible races.


Juergen


(NB this is not ready for application yet, I just wanted to check to
see if it fixes Andre's problem)

   -George

On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
<juergen.gross@xxxxxxxxxxxxxx>   wrote:
Okay, I have some more data.

I activated cpupool_dprintk() and included checks in sched_credit.c to
test for weight inconsistencies. To reduce race possibilities I've added
my patch to execute cpu assigning/unassigning always in a tasklet on the
cpu to be moved.

Here is the result:

(XEN) cpupool_unassign_cpu(pool=0,cpu=6)
(XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
(XEN) cpupool_unassign_cpu(pool=0,cpu=6)
(XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
(XEN) cpupool_assign_cpu(pool=0,cpu=1)
(XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
(XEN) cpupool_assign_cpu(cpu=1) ret 0
(XEN) cpupool_assign_cpu(pool=1,cpu=4)
(XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
(XEN) cpupool_assign_cpu(cpu=4) ret 0
(XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
(XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
(XEN) Xen BUG at sched_credit.c:570
(XEN) ----[ Xen-4.1.0-rc5-pre  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    4
(XEN) RIP:    e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
(XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff830839d3ec30   rcx: 0000000000000000
(XEN) rdx: ffff830839dcff18   rsi: 000000000000000a   rdi: ffff82c4802542e8
(XEN) rbp: ffff830839dcfe38   rsp: ffff830839dcfde8   r8:  0000000000000004
(XEN) r9:  ffff82c480213520   r10: 00000000fffffffc   r11: 0000000000000001
(XEN) r12: 0000000000000004   r13: ffff830839d3ec40   r14: ffff831002ad5e40
(XEN) r15: ffff830839d66f90   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000001020a98000   cr2: 00007fc5e9b79d98
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff830839dcfde8:
(XEN)    ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000
(XEN)    0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651
(XEN)    ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204
(XEN)    0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e
(XEN)    ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100
(XEN)    ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880
(XEN)    ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647
(XEN)    ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20
(XEN)    0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2
(XEN)    00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002
(XEN)    00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50
(XEN)    0000000000000246 0000000000000032 0000000000000000 00000000ffffffff
(XEN)    0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848
(XEN)    00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033
(XEN)    0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000004
(XEN)    ffff830077eee000 00000043b9afd180 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c4801197d7>] csched_tick+0x186/0x37f
(XEN)    [<ffff82c480126204>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
(XEN)    [<ffff82c480123647>] __do_softirq+0x88/0x99
(XEN)    [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 4:
(XEN) Xen BUG at sched_credit.c:570
(XEN) ****************************************

As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON
triggered in csched_acct() is a logical result of this.

How this can happen I don't know yet.
Anyone any idea? I'll keep searching...


Juergen

On 02/15/11 08:22, Juergen Gross wrote:

On 02/14/11 18:57, George Dunlap wrote:

The good news is, I've managed to reproduce this on my local test
hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
attached script. It's time to go home now, but I should be able to
dig something up tomorrow.

To use the script:
* Rename cpupool0 to "p0", and create an empty second pool, "p1"
* You can modify elements by adding "arg=val" as arguments.
* Arguments are:
+ dryrun={true,false} Do the work, but don't actually execute any xl
arguments. Default false.
+ left: Number commands to execute. Default 10.
+ maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
8 cpus).
+ verbose={true,false} Print what you're doing. Default is true.

The script sometimes attempts to remove the last cpu from cpupool0; in
this case, libxl will print an error. If the script gets an error
under that condition, it will ignore it; under any other condition, it
will print diagnostic information.

What finally crashed it for me was this command:
# ./cpupool-test.sh verbose=false left=1000

Nice!
With your script I finally managed to get the error, too. On my box (2
sockets
a 6 cores) I had to use

./cpupool-test.sh verbose=false left=10000 maxcpus=11

to trigger it.
Looking for more data now...


Juergen


-George

On Fri, Feb 11, 2011 at 7:39 AM, Andre
Przywara<andre.przywara@xxxxxxx>   wrote:

Juergen Gross wrote:

On 02/10/11 15:18, Andre Przywara wrote:

Andre Przywara wrote:

On 02/10/2011 07:42 AM, Juergen Gross wrote:

On 02/09/11 15:21, Juergen Gross wrote:

Andre, George,


What seems to be interesting: I think the problem did always occur
when
a new cpupool was created and the first cpu was moved to it.

I think my previous assumption regarding the master_ticker was not
too bad.
I think somehow the master_ticker of the new cpupool is becoming
active
before the scheduler is really initialized properly. This could
happen, if
enough time is spent between alloc_pdata for the cpu to be moved
and
the
critical section in schedule_cpu_switch().

The solution should be to activate the timers only if the
scheduler is
ready for them.

George, do you think the master_ticker should be stopped in
suspend_ticker
as well? I still see potential problems for entering deep C-States.
I think
I'll prepare a patch which will keep the master_ticker active
for the
C-State case and migrate it for the schedule_cpu_switch() case.

Okay, here is a patch for this. It ran on my 4-core machine
without any
problems.
Andre, could you give it a try?

Did, but unfortunately it crashed as always. Tried twice and made
sure
I booted the right kernel. Sorry.
The idea with the race between the timer and the state changing
sounded very appealing, actually that was suspicious to me from the
beginning.

I will add some code to dump the state of all cpupools to the BUG_ON
to see in which situation we are when the bug triggers.

OK, here is a first try of this, the patch iterates over all CPU pools
and outputs some data if the BUG_ON
((sdom->weight * sdom->active_vcpu_count)>   weight_left) condition
triggers:
(XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
fffffffc003f
(XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
(XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
(XEN) Xen BUG at sched_credit.c:1010
....
The masks look proper (6 cores per node), the bug triggers when the
first CPU is about to be(?) inserted.

Sure? I'm missing the cpu with mask 2000.
I'll try to reproduce the problem on a larger machine here (24 cores, 4
numa
nodes).
Andre, can you give me your xen boot parameters? Which xen changeset
are
you
running, and do you have any additional patches in use?

The grub lines:
kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0

All of my experiments are use c/s 22858 as a base.
If you use a AMD Magny-Cours box for your experiments (socket C32 or
G34),
you should add the following patch (removing the line)
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
__clear_bit(X86_FEATURE_SKINIT % 32,&c);
__clear_bit(X86_FEATURE_WDT % 32,&c);
__clear_bit(X86_FEATURE_LWP % 32,&c);
- __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
__clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
break;
case 5: /* MONITOR/MWAIT */

This is not necessary (in fact that reverts my patch c/s 22815), but
raises
the probability to trigger the bug, probably because it increases the
pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
try to
create a guest with many VCPUs and squeeze it into a small CPU-pool.

Good luck ;-)
Andre.

--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel




--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail:
juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details:
ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html


--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.