[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Questions about GPLPV stability tests



On Thu, Dec 8, 2011 at 11:46 PM, Konrad Rzeszutek Wilk
<konrad@xxxxxxxxxx> wrote:
> On Wed, Dec 07, 2011 at 08:44:19PM +0000, Roderick Colenbrander wrote:
>> On Wed, Dec 7, 2011 at 8:38 PM, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> 
>> wrote:
>> >> It took about a week, but the systems went down again. Linux is down,
>> >> but the hypervisor is still reachable on the serial console. Is there
>> >> anything interesting to dump from there?
>> >>
>> > Just the interrupts information. I think that is Ctrl-A, Ctrl-A,
>> > Ctrl-A, and then '*' to capture everything (I don't remember the right
>> > one for just interrupts).
>>
>> Here's the interrupt information:
>> (XEN) [2011-12-06 19:13:37] [i: dump interrupt bindings]
>> (XEN) [2011-12-06 19:13:37] Guest interrupt information:
>> (XEN) [2011-12-06 19:13:37]    IRQ:   0
>> affinity:00000000,00000000,00000000,00000001 vec:f0 type=IO-APIC-edge
>>   status=00000000 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:   1
>> affinity:00000000,00000000,00000000,00000001 vec:30 type=IO-APIC-edge
>>   status=00000014 in-flight=0 domain-list=0:  1(-S--),
>> (XEN) [2011-12-06 19:13:37]    IRQ:   2
>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 type=XT-PIC
>>   status=00000000 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:   3
>> affinity:00000000,00000000,00000000,00000001 vec:38 type=IO-APIC-edge
>>   status=00000006 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:   4
>> affinity:00000000,00000000,00000000,00000001 vec:40 type=IO-APIC-edge
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:   5
>> affinity:00000000,00000000,00000000,00000001 vec:f1 type=IO-APIC-edge
>>   status=00000000 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:   6
>> affinity:00000000,00000000,00000000,00000001 vec:48 type=IO-APIC-edge
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:   7
>> affinity:00000000,00000000,00000000,00000001 vec:50 type=IO-APIC-edge
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:   8
>> affinity:00000000,00000000,00000000,00000001 vec:58 type=IO-APIC-edge
>>   status=00000010 in-flight=0 domain-list=0:  8(-S--),
>> (XEN) [2011-12-06 19:13:37]    IRQ:   9
>> affinity:00000000,00000000,00000000,00000001 vec:60 type=IO-APIC-level
>>   status=00000010 in-flight=0 domain-list=0:  9(-S--),
>> (XEN) [2011-12-06 19:13:37]    IRQ:  10
>> affinity:00000000,00000000,00000000,00000001 vec:68 type=IO-APIC-edge
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:  11
>> affinity:00000000,00000000,00000000,00000001 vec:70 type=IO-APIC-edge
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:  12
>> affinity:00000000,00000000,00000000,00000001 vec:78 type=IO-APIC-edge
>>   status=00000010 in-flight=0 domain-list=0: 12(-S--),
>> (XEN) [2011-12-06 19:13:37]    IRQ:  13
>> affinity:00000000,00000000,00000000,00000001 vec:88 type=IO-APIC-edge
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:  14
>> affinity:00000000,00000000,00000000,00000001 vec:90 type=IO-APIC-edge
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:  15
>> affinity:00000000,00000000,00000000,00000001 vec:98 type=IO-APIC-edge
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:37]    IRQ:  16
>> affinity:00000000,00000000,00000000,00000001 vec:c8 type=IO-APIC-level
>>   status=00000010 in-flight=0 domain-list=0: 16(-S--),1201: 16(--M-),
>> (XEN) [2011-12-06 19:13:37]    IRQ:  18
>> affinity:00000000,00000000,00000000,00000001 vec:51 type=IO-APIC-level
>>   status=00000010 in-flight=0 domain-list=0: 18(-S--),
>> (XEN) [2011-12-06 19:13:37]    IRQ:  19
>> affinity:00000000,00000000,00000000,00000001 vec:d0 type=IO-APIC-level
>>   status=00000010 in-flight=0 domain-list=0: 19(-S--),
>> (XEN) [2011-12-06 19:13:37]    IRQ:  21
>> affinity:00000000,00000000,00000000,00000001 vec:61 type=IO-APIC-level
>>   status=00000010 in-flight=0 domain-list=0: 21(-S--),
>> (XEN) [2011-12-06 19:13:37]    IRQ:  23
>> affinity:00000000,00000000,00000000,00000001 vec:59 type=IO-APIC-level
>>   status=00000010 in-flight=1 domain-list=0: 23(PS-M),
>> (XEN) [2011-12-06 19:13:37]    IRQ:  24
>> affinity:00000000,00000000,00000000,00000001 vec:28 type=DMA_MSI
>>   status=00000000 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  25
>> affinity:00000000,00000000,00000000,00000001 vec:a0 type=PCI-MSI
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  26
>> affinity:00000000,00000000,00000000,00000001 vec:a8 type=PCI-MSI
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  27
>> affinity:00000000,00000000,00000000,00000001 vec:b0 type=PCI-MSI
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  28
>> affinity:00000000,00000000,00000000,00000001 vec:b8 type=PCI-MSI
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  29
>> affinity:00000000,00000000,00000000,00000001 vec:c0 type=PCI-MSI
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  30
>> affinity:00000000,00000000,00000000,00000001 vec:d8 type=PCI-MSI
>>   status=00000014 in-flight=0 domain-list=0:274(PS--),
>> (XEN) [2011-12-06 19:13:38]    IRQ:  31
>> affinity:00000000,00000000,00000000,00000001 vec:21 type=PCI-MSI
>>   status=00000010 in-flight=0 domain-list=0:273(PS--),
>> (XEN) [2011-12-06 19:13:38]    IRQ:  32
>> affinity:00000000,00000000,00000000,00000001 vec:29 type=PCI-MSI
>>   status=00000010 in-flight=0 domain-list=0:272(PS--),
>> (XEN) [2011-12-06 19:13:38]    IRQ:  33
>> affinity:00000000,00000000,00000000,00000001 vec:31 type=PCI-MSI
>>   status=00000010 in-flight=0 domain-list=0:271(PS--),
>> (XEN) [2011-12-06 19:13:38]    IRQ:  34
>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:39 type=PCI-MSI
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  35
>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:41 type=PCI-MSI
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  36
>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:49 type=PCI-MSI
>>   status=00000002 mapped, unbound
>> (XEN) [2011-12-06 19:13:38]    IRQ:  37
>> affinity:00000000,00000000,00000000,00000010 vec:65 type=PCI-MSI
>>   status=00000010 in-flight=0 domain-list=1201: 55(--M-),
>> (XEN) [2011-12-06 19:13:38] IO-APIC interrupt information:
>> (XEN) [2011-12-06 19:13:38]     IRQ  0 Vec240:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  2: vector=240,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ  1 Vec 48:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  1: vector=48,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ  3 Vec 56:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  3: vector=56,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ  4 Vec 64:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  4: vector=64,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ  5 Vec241:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  5: vector=241,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ  6 Vec 72:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  6: vector=72,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ  7 Vec 80:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  7: vector=80,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ  8 Vec 88:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  8: vector=88,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ  9 Vec 96:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  9: vector=96,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=level, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 10 Vec104:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 10: vector=104,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 11 Vec112:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 11: vector=112,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 12 Vec120:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 12: vector=120,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 13 Vec136:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 13: vector=136,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 14 Vec144:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 14: vector=144,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 15 Vec152:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 15: vector=152,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>> irr=0, trigger=edge, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 16 Vec200:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 16: vector=200,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1,
>> irr=0, trigger=level, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 18 Vec 81:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 18: vector=81,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1,
>> irr=0, trigger=level, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 19 Vec208:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 19: vector=208,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1,
>> irr=0, trigger=level, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 21 Vec 97:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 21: vector=97,
>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1,
>> irr=0, trigger=level, mask=0, dest_id:1
>> (XEN) [2011-12-06 19:13:38]     IRQ 23 Vec 89:
>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 23: vector=89,
>> delivery_mode=1, dest_mode=logical, delivery_status=1, polarity=1,
>> irr=1, trigger=level, mask=0, dest_id:1
>>
>> I noticed some potential interesting things. The system in question is
>> using PCI passthrough. On the serial console I remember seeing that
>> the PCI device got unmapped from DomU and got mapped again in Dom0.
>> The serial console log still had a lot of information about this DomU
>> which I guess was going down. I guess it wasn't fully down yet.
>
> One of the debug informations that gets printed with '*' is the guest
> PCI passthrough data. Such as which IRQ, BAR, etc are mapped in. Did any
> of those exist?

There is not much PCI related information. Just interrupt stuff, no
PCI BARs or other interesting PCI stuff:
(XEN) [2011-12-06 19:13:20] 03:00.0 - dom 1201 - MSIs < 37 >
(XEN) [2011-12-06 19:13:20]  MSI    37 vec=65 lowest  edge   assert
log lowest dest=00000010 mask=0/0/-1


> My original thought of what is going on is that the interrupts either
> stopped completly working (does not look like - it looks like the
> interrupts are firring and the event channels that are bound to it have
> the bits set saying - hey I am pending). Oddly there are bunch of MSI
> ones that are masked which usually means they are disabled.
>
> Then there is the
>  affinity:00000000,00000000,00000000,00000001 vec:c8 type=IO-APIC-level
>   status=00000010 in-flight=0 domain-list=0: 16(-S--),1201: 16(--M-),
>  affinity:00000000,00000000,00000000,00000010 vec:65 type=PCI-MSI
>   status=00000010 in-flight=0 domain-list=1201: 55(--M-),
>
> The guest has masked both interrupts - so it is off//dying, but somehow
> the unmapping has not happend. In other words, what you just analyzed
> and found out.
>
> Sadly, there is no smoking gun..
>
> So a couple of things that I would do is to ensure that the Xen
> hypervisor boots with 'sync_console console_to_ring' and when this
> crash happens see if I can get a stack trace from both the Xen
> hypervisor (there are some debug parameters to get that - I think it is
> 'r'?), and also from the Linux kernel.

I set up some new tests, so that will take some days to get a crash.
For what it is worth, I do have a '*' log left from this system and it
has some stack trace. If you think it is useful, I could send it
gzipped, but I don't want to spam this list if it may not be useful.

One thing I did notice was the following:
(XEN) [2011-12-06 19:13:39] active vcpus:
(XEN) [2011-12-06 19:13:39]       1: [1201.2] pri=-2 flags=0 cpu=6
credit=-2411 [w=256]
(XEN) [2011-12-06 19:13:39]       2: [1201.1] pri=-2 flags=0 cpu=5
credit=-2460 [w=256]
(XEN) [2011-12-06 19:13:39]       3: [0.2] pri=0 flags=0 cpu=2
credit=142 [w=256]
(XEN) [2011-12-06 19:13:39]       4: [0.1] pri=-2 flags=0 cpu=1
credit=-2612 [w=256]
(XEN) [2011-12-06 19:13:39] CPU[00]  sort=12511062,
sibling=00000000,00000000,00000000,00000003,
core=00000000,00000000,00000000,000000ff
(XEN) [2011-12-06 19:13:39]     run: [32767.0] pri=0 flags=0 cpu=0
(XEN) [2011-12-06 19:13:39]       1: [0.0] pri=-1 flags=0 cpu=0
credit=333 [w=256]
(XEN) [2011-12-06 19:13:39] CPU[01]  sort=12511063,
sibling=00000000,00000000,00000000,00000003,
core=00000000,00000000,00000000,000000ff
(XEN) [2011-12-06 19:13:39]     run: [0.1] pri=-2 flags=0 cpu=1
credit=-2914 [w=256]
(XEN) [2011-12-06 19:13:39]       1: [32767.1] pri=-64 flags=0 cpu=1
(XEN) [2011-12-06 19:13:39] CPU[02]  sort=12511063,
sibling=00000000,00000000,00000000,0000000c,
core=00000000,00000000,00000000,000000ff
(XEN) [2011-12-06 19:13:39]     run: [32767.2] pri=-64 flags=0 cpu=2
(XEN) [2011-12-06 19:13:39] CPU[03]  sort=12511063,
sibling=00000000,00000000,00000000,0000000c,
core=00000000,00000000,00000000,000000ff
(XEN) [2011-12-06 19:13:39]     run: [32767.3] pri=-64 flags=0 cpu=3
(XEN) [2011-12-06 19:13:39] CPU[04]  sort=12511063,
sibling=00000000,00000000,00000000,00000030,
core=00000000,00000000,00000000,000000ff
(XEN) [2011-12-06 19:13:39]     run: [32767.4] pri=-64 flags=0 cpu=4
(XEN) [2011-12-06 19:13:39] CPU[05]  sort=12511063,
sibling=00000000,00000000,00000000,00000030,
core=00000000,00000000,00000000,000000ff
(XEN) [2011-12-06 19:13:39]     run: [1201.1] pri=-2 flags=0 cpu=5
credit=-3617 [w=256]
(XEN) [2011-12-06 19:13:39]       1: [32767.5] pri=-64 flags=0 cpu=5
(XEN) [2011-12-06 19:13:39] CPU[06]  sort=12511063,
sibling=00000000,00000000,00000000,000000c0,
core=00000000,00000000,00000000,000000ff
(XEN) [2011-12-06 19:13:39]     run: [1201.2] pri=-2 flags=0 cpu=6
credit=-3918 [w=256]
(XEN) [2011-12-06 19:13:39]       1: [32767.6] pri=-64 flags=0 cpu=6
(XEN) [2011-12-06 19:13:39] CPU[07]  sort=12511063,
sibling=00000000,00000000,00000000,000000c0,
core=00000000,00000000,00000000,000000ff
(XEN) [2011-12-06 19:13:39]     run: [32767.7] pri=-64 flags=0 cpu=7

The system in question has a 4-core i7 CPU where Dom0 is pinned to
core 0-3 and DomU is pinned 4-7. Aren't the negative credit numbers
quite big?

>
> But also see if the problem disappears with the latest 4.1.x hypervisor.
> And it might also be worth upgrading the dom0 to a 3.0. Hmm, it would be
> very interesting to see where the dom0 _is_ stuck at. The hypervisor is
> running fine so it all points to dom0 crashing somewhere..
>
> Can you make sure that dom0 runs with 'debug loglevel=8' as well. That
> should give some wealth of information when/if a crash happens.
> Oh, and try to pass in Ctrl-Alt-SysRQ-t (on serial concentrators I think
> you just need to type 'break' on the telnet line)and then t. But I am
> not entirely sure about it - might want to double check Google and see
> how to enable Alt-SysRQ.

I tried to get SysRq working. It worked fine on some of my other
machines (sysrq= 'shift-control-o'), but not on this broken box.
Apparently the kernel is stuck in some place where SysRq doesn't work.

I'm going to retest with the relevant logging turned on. I also
upgraded to the latest 2.6.32 kernel (.48 I think). I will also
upgrade to the latest 4.1.x build from the mercurial repository. It
will take a few days before I get crash.

Thanks so far!
Roderick

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.