[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Questions about GPLPV stability tests



One interesting observation. This morning I had another of my stress
machines with the problem. I never had it on this problem before and
it didn't have any of the new logging / software updates yet.

The system was in the same state as the other machine which I reported
about before. I tried SysRq stuff and other things. While I was about
to reboot the system, a login prompt appeared again on the VGA. I
don't know whether any of the stuff I did triggered it or not. Anyway
it means Linux is not really dead. I tried logging, but I don't even
see characters appearing. The system feels to be very, very, very
slow.

The other tests are still running.

Roderick

On Thu, Dec 8, 2011 at 5:33 PM, Roderick Colenbrander
<thunderbird2k@xxxxxxxxx> wrote:
> On Thu, Dec 8, 2011 at 11:46 PM, Konrad Rzeszutek Wilk
> <konrad@xxxxxxxxxx> wrote:
>> On Wed, Dec 07, 2011 at 08:44:19PM +0000, Roderick Colenbrander wrote:
>>> On Wed, Dec 7, 2011 at 8:38 PM, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> 
>>> wrote:
>>> >> It took about a week, but the systems went down again. Linux is down,
>>> >> but the hypervisor is still reachable on the serial console. Is there
>>> >> anything interesting to dump from there?
>>> >>
>>> > Just the interrupts information. I think that is Ctrl-A, Ctrl-A,
>>> > Ctrl-A, and then '*' to capture everything (I don't remember the right
>>> > one for just interrupts).
>>>
>>> Here's the interrupt information:
>>> (XEN) [2011-12-06 19:13:37] [i: dump interrupt bindings]
>>> (XEN) [2011-12-06 19:13:37] Guest interrupt information:
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   0
>>> affinity:00000000,00000000,00000000,00000001 vec:f0 type=IO-APIC-edge
>>>   status=00000000 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   1
>>> affinity:00000000,00000000,00000000,00000001 vec:30 type=IO-APIC-edge
>>>   status=00000014 in-flight=0 domain-list=0:  1(-S--),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   2
>>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 type=XT-PIC
>>>   status=00000000 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   3
>>> affinity:00000000,00000000,00000000,00000001 vec:38 type=IO-APIC-edge
>>>   status=00000006 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   4
>>> affinity:00000000,00000000,00000000,00000001 vec:40 type=IO-APIC-edge
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   5
>>> affinity:00000000,00000000,00000000,00000001 vec:f1 type=IO-APIC-edge
>>>   status=00000000 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   6
>>> affinity:00000000,00000000,00000000,00000001 vec:48 type=IO-APIC-edge
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   7
>>> affinity:00000000,00000000,00000000,00000001 vec:50 type=IO-APIC-edge
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   8
>>> affinity:00000000,00000000,00000000,00000001 vec:58 type=IO-APIC-edge
>>>   status=00000010 in-flight=0 domain-list=0:  8(-S--),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:   9
>>> affinity:00000000,00000000,00000000,00000001 vec:60 type=IO-APIC-level
>>>   status=00000010 in-flight=0 domain-list=0:  9(-S--),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  10
>>> affinity:00000000,00000000,00000000,00000001 vec:68 type=IO-APIC-edge
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  11
>>> affinity:00000000,00000000,00000000,00000001 vec:70 type=IO-APIC-edge
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  12
>>> affinity:00000000,00000000,00000000,00000001 vec:78 type=IO-APIC-edge
>>>   status=00000010 in-flight=0 domain-list=0: 12(-S--),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  13
>>> affinity:00000000,00000000,00000000,00000001 vec:88 type=IO-APIC-edge
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  14
>>> affinity:00000000,00000000,00000000,00000001 vec:90 type=IO-APIC-edge
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  15
>>> affinity:00000000,00000000,00000000,00000001 vec:98 type=IO-APIC-edge
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  16
>>> affinity:00000000,00000000,00000000,00000001 vec:c8 type=IO-APIC-level
>>>   status=00000010 in-flight=0 domain-list=0: 16(-S--),1201: 16(--M-),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  18
>>> affinity:00000000,00000000,00000000,00000001 vec:51 type=IO-APIC-level
>>>   status=00000010 in-flight=0 domain-list=0: 18(-S--),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  19
>>> affinity:00000000,00000000,00000000,00000001 vec:d0 type=IO-APIC-level
>>>   status=00000010 in-flight=0 domain-list=0: 19(-S--),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  21
>>> affinity:00000000,00000000,00000000,00000001 vec:61 type=IO-APIC-level
>>>   status=00000010 in-flight=0 domain-list=0: 21(-S--),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  23
>>> affinity:00000000,00000000,00000000,00000001 vec:59 type=IO-APIC-level
>>>   status=00000010 in-flight=1 domain-list=0: 23(PS-M),
>>> (XEN) [2011-12-06 19:13:37]    IRQ:  24
>>> affinity:00000000,00000000,00000000,00000001 vec:28 type=DMA_MSI
>>>   status=00000000 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  25
>>> affinity:00000000,00000000,00000000,00000001 vec:a0 type=PCI-MSI
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  26
>>> affinity:00000000,00000000,00000000,00000001 vec:a8 type=PCI-MSI
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  27
>>> affinity:00000000,00000000,00000000,00000001 vec:b0 type=PCI-MSI
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  28
>>> affinity:00000000,00000000,00000000,00000001 vec:b8 type=PCI-MSI
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  29
>>> affinity:00000000,00000000,00000000,00000001 vec:c0 type=PCI-MSI
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  30
>>> affinity:00000000,00000000,00000000,00000001 vec:d8 type=PCI-MSI
>>>   status=00000014 in-flight=0 domain-list=0:274(PS--),
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  31
>>> affinity:00000000,00000000,00000000,00000001 vec:21 type=PCI-MSI
>>>   status=00000010 in-flight=0 domain-list=0:273(PS--),
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  32
>>> affinity:00000000,00000000,00000000,00000001 vec:29 type=PCI-MSI
>>>   status=00000010 in-flight=0 domain-list=0:272(PS--),
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  33
>>> affinity:00000000,00000000,00000000,00000001 vec:31 type=PCI-MSI
>>>   status=00000010 in-flight=0 domain-list=0:271(PS--),
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  34
>>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:39 type=PCI-MSI
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  35
>>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:41 type=PCI-MSI
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  36
>>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:49 type=PCI-MSI
>>>   status=00000002 mapped, unbound
>>> (XEN) [2011-12-06 19:13:38]    IRQ:  37
>>> affinity:00000000,00000000,00000000,00000010 vec:65 type=PCI-MSI
>>>   status=00000010 in-flight=0 domain-list=1201: 55(--M-),
>>> (XEN) [2011-12-06 19:13:38] IO-APIC interrupt information:
>>> (XEN) [2011-12-06 19:13:38]     IRQ  0 Vec240:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  2: vector=240,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ  1 Vec 48:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  1: vector=48,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ  3 Vec 56:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  3: vector=56,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ  4 Vec 64:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  4: vector=64,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ  5 Vec241:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  5: vector=241,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ  6 Vec 72:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  6: vector=72,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ  7 Vec 80:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  7: vector=80,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ  8 Vec 88:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  8: vector=88,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ  9 Vec 96:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin  9: vector=96,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=level, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 10 Vec104:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 10: vector=104,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 11 Vec112:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 11: vector=112,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 12 Vec120:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 12: vector=120,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 13 Vec136:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 13: vector=136,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 14 Vec144:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 14: vector=144,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 15 Vec152:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 15: vector=152,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0,
>>> irr=0, trigger=edge, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 16 Vec200:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 16: vector=200,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1,
>>> irr=0, trigger=level, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 18 Vec 81:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 18: vector=81,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1,
>>> irr=0, trigger=level, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 19 Vec208:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 19: vector=208,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1,
>>> irr=0, trigger=level, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 21 Vec 97:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 21: vector=97,
>>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1,
>>> irr=0, trigger=level, mask=0, dest_id:1
>>> (XEN) [2011-12-06 19:13:38]     IRQ 23 Vec 89:
>>> (XEN) [2011-12-06 19:13:38]       Apic 0x00, Pin 23: vector=89,
>>> delivery_mode=1, dest_mode=logical, delivery_status=1, polarity=1,
>>> irr=1, trigger=level, mask=0, dest_id:1
>>>
>>> I noticed some potential interesting things. The system in question is
>>> using PCI passthrough. On the serial console I remember seeing that
>>> the PCI device got unmapped from DomU and got mapped again in Dom0.
>>> The serial console log still had a lot of information about this DomU
>>> which I guess was going down. I guess it wasn't fully down yet.
>>
>> One of the debug informations that gets printed with '*' is the guest
>> PCI passthrough data. Such as which IRQ, BAR, etc are mapped in. Did any
>> of those exist?
>
> There is not much PCI related information. Just interrupt stuff, no
> PCI BARs or other interesting PCI stuff:
> (XEN) [2011-12-06 19:13:20] 03:00.0 - dom 1201 - MSIs < 37 >
> (XEN) [2011-12-06 19:13:20]  MSI    37 vec=65 lowest  edge   assert
> log lowest dest=00000010 mask=0/0/-1
>
>
>> My original thought of what is going on is that the interrupts either
>> stopped completly working (does not look like - it looks like the
>> interrupts are firring and the event channels that are bound to it have
>> the bits set saying - hey I am pending). Oddly there are bunch of MSI
>> ones that are masked which usually means they are disabled.
>>
>> Then there is the
>>  affinity:00000000,00000000,00000000,00000001 vec:c8 type=IO-APIC-level
>>   status=00000010 in-flight=0 domain-list=0: 16(-S--),1201: 16(--M-),
>>  affinity:00000000,00000000,00000000,00000010 vec:65 type=PCI-MSI
>>   status=00000010 in-flight=0 domain-list=1201: 55(--M-),
>>
>> The guest has masked both interrupts - so it is off//dying, but somehow
>> the unmapping has not happend. In other words, what you just analyzed
>> and found out.
>>
>> Sadly, there is no smoking gun..
>>
>> So a couple of things that I would do is to ensure that the Xen
>> hypervisor boots with 'sync_console console_to_ring' and when this
>> crash happens see if I can get a stack trace from both the Xen
>> hypervisor (there are some debug parameters to get that - I think it is
>> 'r'?), and also from the Linux kernel.
>
> I set up some new tests, so that will take some days to get a crash.
> For what it is worth, I do have a '*' log left from this system and it
> has some stack trace. If you think it is useful, I could send it
> gzipped, but I don't want to spam this list if it may not be useful.
>
> One thing I did notice was the following:
> (XEN) [2011-12-06 19:13:39] active vcpus:
> (XEN) [2011-12-06 19:13:39]       1: [1201.2] pri=-2 flags=0 cpu=6
> credit=-2411 [w=256]
> (XEN) [2011-12-06 19:13:39]       2: [1201.1] pri=-2 flags=0 cpu=5
> credit=-2460 [w=256]
> (XEN) [2011-12-06 19:13:39]       3: [0.2] pri=0 flags=0 cpu=2
> credit=142 [w=256]
> (XEN) [2011-12-06 19:13:39]       4: [0.1] pri=-2 flags=0 cpu=1
> credit=-2612 [w=256]
> (XEN) [2011-12-06 19:13:39] CPU[00]  sort=12511062,
> sibling=00000000,00000000,00000000,00000003,
> core=00000000,00000000,00000000,000000ff
> (XEN) [2011-12-06 19:13:39]     run: [32767.0] pri=0 flags=0 cpu=0
> (XEN) [2011-12-06 19:13:39]       1: [0.0] pri=-1 flags=0 cpu=0
> credit=333 [w=256]
> (XEN) [2011-12-06 19:13:39] CPU[01]  sort=12511063,
> sibling=00000000,00000000,00000000,00000003,
> core=00000000,00000000,00000000,000000ff
> (XEN) [2011-12-06 19:13:39]     run: [0.1] pri=-2 flags=0 cpu=1
> credit=-2914 [w=256]
> (XEN) [2011-12-06 19:13:39]       1: [32767.1] pri=-64 flags=0 cpu=1
> (XEN) [2011-12-06 19:13:39] CPU[02]  sort=12511063,
> sibling=00000000,00000000,00000000,0000000c,
> core=00000000,00000000,00000000,000000ff
> (XEN) [2011-12-06 19:13:39]     run: [32767.2] pri=-64 flags=0 cpu=2
> (XEN) [2011-12-06 19:13:39] CPU[03]  sort=12511063,
> sibling=00000000,00000000,00000000,0000000c,
> core=00000000,00000000,00000000,000000ff
> (XEN) [2011-12-06 19:13:39]     run: [32767.3] pri=-64 flags=0 cpu=3
> (XEN) [2011-12-06 19:13:39] CPU[04]  sort=12511063,
> sibling=00000000,00000000,00000000,00000030,
> core=00000000,00000000,00000000,000000ff
> (XEN) [2011-12-06 19:13:39]     run: [32767.4] pri=-64 flags=0 cpu=4
> (XEN) [2011-12-06 19:13:39] CPU[05]  sort=12511063,
> sibling=00000000,00000000,00000000,00000030,
> core=00000000,00000000,00000000,000000ff
> (XEN) [2011-12-06 19:13:39]     run: [1201.1] pri=-2 flags=0 cpu=5
> credit=-3617 [w=256]
> (XEN) [2011-12-06 19:13:39]       1: [32767.5] pri=-64 flags=0 cpu=5
> (XEN) [2011-12-06 19:13:39] CPU[06]  sort=12511063,
> sibling=00000000,00000000,00000000,000000c0,
> core=00000000,00000000,00000000,000000ff
> (XEN) [2011-12-06 19:13:39]     run: [1201.2] pri=-2 flags=0 cpu=6
> credit=-3918 [w=256]
> (XEN) [2011-12-06 19:13:39]       1: [32767.6] pri=-64 flags=0 cpu=6
> (XEN) [2011-12-06 19:13:39] CPU[07]  sort=12511063,
> sibling=00000000,00000000,00000000,000000c0,
> core=00000000,00000000,00000000,000000ff
> (XEN) [2011-12-06 19:13:39]     run: [32767.7] pri=-64 flags=0 cpu=7
>
> The system in question has a 4-core i7 CPU where Dom0 is pinned to
> core 0-3 and DomU is pinned 4-7. Aren't the negative credit numbers
> quite big?
>
>>
>> But also see if the problem disappears with the latest 4.1.x hypervisor.
>> And it might also be worth upgrading the dom0 to a 3.0. Hmm, it would be
>> very interesting to see where the dom0 _is_ stuck at. The hypervisor is
>> running fine so it all points to dom0 crashing somewhere..
>>
>> Can you make sure that dom0 runs with 'debug loglevel=8' as well. That
>> should give some wealth of information when/if a crash happens.
>> Oh, and try to pass in Ctrl-Alt-SysRQ-t (on serial concentrators I think
>> you just need to type 'break' on the telnet line)and then t. But I am
>> not entirely sure about it - might want to double check Google and see
>> how to enable Alt-SysRQ.
>
> I tried to get SysRq working. It worked fine on some of my other
> machines (sysrq= 'shift-control-o'), but not on this broken box.
> Apparently the kernel is stuck in some place where SysRq doesn't work.
>
> I'm going to retest with the relevant logging turned on. I also
> upgraded to the latest 2.6.32 kernel (.48 I think). I will also
> upgrade to the latest 4.1.x build from the mercurial repository. It
> will take a few days before I get crash.
>
> Thanks so far!
> Roderick

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.