[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Questions about GPLPV stability tests
One interesting observation. This morning I had another of my stress machines with the problem. I never had it on this problem before and it didn't have any of the new logging / software updates yet. The system was in the same state as the other machine which I reported about before. I tried SysRq stuff and other things. While I was about to reboot the system, a login prompt appeared again on the VGA. I don't know whether any of the stuff I did triggered it or not. Anyway it means Linux is not really dead. I tried logging, but I don't even see characters appearing. The system feels to be very, very, very slow. The other tests are still running. Roderick On Thu, Dec 8, 2011 at 5:33 PM, Roderick Colenbrander <thunderbird2k@xxxxxxxxx> wrote: > On Thu, Dec 8, 2011 at 11:46 PM, Konrad Rzeszutek Wilk > <konrad@xxxxxxxxxx> wrote: >> On Wed, Dec 07, 2011 at 08:44:19PM +0000, Roderick Colenbrander wrote: >>> On Wed, Dec 7, 2011 at 8:38 PM, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> >>> wrote: >>> >> It took about a week, but the systems went down again. Linux is down, >>> >> but the hypervisor is still reachable on the serial console. Is there >>> >> anything interesting to dump from there? >>> >> >>> > Just the interrupts information. I think that is Ctrl-A, Ctrl-A, >>> > Ctrl-A, and then '*' to capture everything (I don't remember the right >>> > one for just interrupts). >>> >>> Here's the interrupt information: >>> (XEN) [2011-12-06 19:13:37] [i: dump interrupt bindings] >>> (XEN) [2011-12-06 19:13:37] Guest interrupt information: >>> (XEN) [2011-12-06 19:13:37] IRQ: 0 >>> affinity:00000000,00000000,00000000,00000001 vec:f0 type=IO-APIC-edge >>> status=00000000 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 1 >>> affinity:00000000,00000000,00000000,00000001 vec:30 type=IO-APIC-edge >>> status=00000014 in-flight=0 domain-list=0: 1(-S--), >>> (XEN) [2011-12-06 19:13:37] IRQ: 2 >>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 type=XT-PIC >>> status=00000000 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 3 >>> affinity:00000000,00000000,00000000,00000001 vec:38 type=IO-APIC-edge >>> status=00000006 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 4 >>> affinity:00000000,00000000,00000000,00000001 vec:40 type=IO-APIC-edge >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 5 >>> affinity:00000000,00000000,00000000,00000001 vec:f1 type=IO-APIC-edge >>> status=00000000 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 6 >>> affinity:00000000,00000000,00000000,00000001 vec:48 type=IO-APIC-edge >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 7 >>> affinity:00000000,00000000,00000000,00000001 vec:50 type=IO-APIC-edge >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 8 >>> affinity:00000000,00000000,00000000,00000001 vec:58 type=IO-APIC-edge >>> status=00000010 in-flight=0 domain-list=0: 8(-S--), >>> (XEN) [2011-12-06 19:13:37] IRQ: 9 >>> affinity:00000000,00000000,00000000,00000001 vec:60 type=IO-APIC-level >>> status=00000010 in-flight=0 domain-list=0: 9(-S--), >>> (XEN) [2011-12-06 19:13:37] IRQ: 10 >>> affinity:00000000,00000000,00000000,00000001 vec:68 type=IO-APIC-edge >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 11 >>> affinity:00000000,00000000,00000000,00000001 vec:70 type=IO-APIC-edge >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 12 >>> affinity:00000000,00000000,00000000,00000001 vec:78 type=IO-APIC-edge >>> status=00000010 in-flight=0 domain-list=0: 12(-S--), >>> (XEN) [2011-12-06 19:13:37] IRQ: 13 >>> affinity:00000000,00000000,00000000,00000001 vec:88 type=IO-APIC-edge >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 14 >>> affinity:00000000,00000000,00000000,00000001 vec:90 type=IO-APIC-edge >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 15 >>> affinity:00000000,00000000,00000000,00000001 vec:98 type=IO-APIC-edge >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:37] IRQ: 16 >>> affinity:00000000,00000000,00000000,00000001 vec:c8 type=IO-APIC-level >>> status=00000010 in-flight=0 domain-list=0: 16(-S--),1201: 16(--M-), >>> (XEN) [2011-12-06 19:13:37] IRQ: 18 >>> affinity:00000000,00000000,00000000,00000001 vec:51 type=IO-APIC-level >>> status=00000010 in-flight=0 domain-list=0: 18(-S--), >>> (XEN) [2011-12-06 19:13:37] IRQ: 19 >>> affinity:00000000,00000000,00000000,00000001 vec:d0 type=IO-APIC-level >>> status=00000010 in-flight=0 domain-list=0: 19(-S--), >>> (XEN) [2011-12-06 19:13:37] IRQ: 21 >>> affinity:00000000,00000000,00000000,00000001 vec:61 type=IO-APIC-level >>> status=00000010 in-flight=0 domain-list=0: 21(-S--), >>> (XEN) [2011-12-06 19:13:37] IRQ: 23 >>> affinity:00000000,00000000,00000000,00000001 vec:59 type=IO-APIC-level >>> status=00000010 in-flight=1 domain-list=0: 23(PS-M), >>> (XEN) [2011-12-06 19:13:37] IRQ: 24 >>> affinity:00000000,00000000,00000000,00000001 vec:28 type=DMA_MSI >>> status=00000000 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 25 >>> affinity:00000000,00000000,00000000,00000001 vec:a0 type=PCI-MSI >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 26 >>> affinity:00000000,00000000,00000000,00000001 vec:a8 type=PCI-MSI >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 27 >>> affinity:00000000,00000000,00000000,00000001 vec:b0 type=PCI-MSI >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 28 >>> affinity:00000000,00000000,00000000,00000001 vec:b8 type=PCI-MSI >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 29 >>> affinity:00000000,00000000,00000000,00000001 vec:c0 type=PCI-MSI >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 30 >>> affinity:00000000,00000000,00000000,00000001 vec:d8 type=PCI-MSI >>> status=00000014 in-flight=0 domain-list=0:274(PS--), >>> (XEN) [2011-12-06 19:13:38] IRQ: 31 >>> affinity:00000000,00000000,00000000,00000001 vec:21 type=PCI-MSI >>> status=00000010 in-flight=0 domain-list=0:273(PS--), >>> (XEN) [2011-12-06 19:13:38] IRQ: 32 >>> affinity:00000000,00000000,00000000,00000001 vec:29 type=PCI-MSI >>> status=00000010 in-flight=0 domain-list=0:272(PS--), >>> (XEN) [2011-12-06 19:13:38] IRQ: 33 >>> affinity:00000000,00000000,00000000,00000001 vec:31 type=PCI-MSI >>> status=00000010 in-flight=0 domain-list=0:271(PS--), >>> (XEN) [2011-12-06 19:13:38] IRQ: 34 >>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:39 type=PCI-MSI >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 35 >>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:41 type=PCI-MSI >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 36 >>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:49 type=PCI-MSI >>> status=00000002 mapped, unbound >>> (XEN) [2011-12-06 19:13:38] IRQ: 37 >>> affinity:00000000,00000000,00000000,00000010 vec:65 type=PCI-MSI >>> status=00000010 in-flight=0 domain-list=1201: 55(--M-), >>> (XEN) [2011-12-06 19:13:38] IO-APIC interrupt information: >>> (XEN) [2011-12-06 19:13:38] IRQ 0 Vec240: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 2: vector=240, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 1 Vec 48: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 1: vector=48, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 3 Vec 56: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 3: vector=56, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 4 Vec 64: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 4: vector=64, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 5 Vec241: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 5: vector=241, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 6 Vec 72: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 6: vector=72, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 7 Vec 80: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 7: vector=80, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 8 Vec 88: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 8: vector=88, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 9 Vec 96: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 9: vector=96, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=level, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 10 Vec104: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 10: vector=104, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 11 Vec112: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 11: vector=112, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 12 Vec120: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 12: vector=120, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 13 Vec136: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 13: vector=136, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 14 Vec144: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 14: vector=144, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 15 Vec152: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 15: vector=152, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, >>> irr=0, trigger=edge, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 16 Vec200: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 16: vector=200, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1, >>> irr=0, trigger=level, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 18 Vec 81: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 18: vector=81, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1, >>> irr=0, trigger=level, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 19 Vec208: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 19: vector=208, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1, >>> irr=0, trigger=level, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 21 Vec 97: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 21: vector=97, >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1, >>> irr=0, trigger=level, mask=0, dest_id:1 >>> (XEN) [2011-12-06 19:13:38] IRQ 23 Vec 89: >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 23: vector=89, >>> delivery_mode=1, dest_mode=logical, delivery_status=1, polarity=1, >>> irr=1, trigger=level, mask=0, dest_id:1 >>> >>> I noticed some potential interesting things. The system in question is >>> using PCI passthrough. On the serial console I remember seeing that >>> the PCI device got unmapped from DomU and got mapped again in Dom0. >>> The serial console log still had a lot of information about this DomU >>> which I guess was going down. I guess it wasn't fully down yet. >> >> One of the debug informations that gets printed with '*' is the guest >> PCI passthrough data. Such as which IRQ, BAR, etc are mapped in. Did any >> of those exist? > > There is not much PCI related information. Just interrupt stuff, no > PCI BARs or other interesting PCI stuff: > (XEN) [2011-12-06 19:13:20] 03:00.0 - dom 1201 - MSIs < 37 > > (XEN) [2011-12-06 19:13:20] MSI 37 vec=65 lowest edge assert > log lowest dest=00000010 mask=0/0/-1 > > >> My original thought of what is going on is that the interrupts either >> stopped completly working (does not look like - it looks like the >> interrupts are firring and the event channels that are bound to it have >> the bits set saying - hey I am pending). Oddly there are bunch of MSI >> ones that are masked which usually means they are disabled. >> >> Then there is the >> affinity:00000000,00000000,00000000,00000001 vec:c8 type=IO-APIC-level >> status=00000010 in-flight=0 domain-list=0: 16(-S--),1201: 16(--M-), >> affinity:00000000,00000000,00000000,00000010 vec:65 type=PCI-MSI >> status=00000010 in-flight=0 domain-list=1201: 55(--M-), >> >> The guest has masked both interrupts - so it is off//dying, but somehow >> the unmapping has not happend. In other words, what you just analyzed >> and found out. >> >> Sadly, there is no smoking gun.. >> >> So a couple of things that I would do is to ensure that the Xen >> hypervisor boots with 'sync_console console_to_ring' and when this >> crash happens see if I can get a stack trace from both the Xen >> hypervisor (there are some debug parameters to get that - I think it is >> 'r'?), and also from the Linux kernel. > > I set up some new tests, so that will take some days to get a crash. > For what it is worth, I do have a '*' log left from this system and it > has some stack trace. If you think it is useful, I could send it > gzipped, but I don't want to spam this list if it may not be useful. > > One thing I did notice was the following: > (XEN) [2011-12-06 19:13:39] active vcpus: > (XEN) [2011-12-06 19:13:39] 1: [1201.2] pri=-2 flags=0 cpu=6 > credit=-2411 [w=256] > (XEN) [2011-12-06 19:13:39] 2: [1201.1] pri=-2 flags=0 cpu=5 > credit=-2460 [w=256] > (XEN) [2011-12-06 19:13:39] 3: [0.2] pri=0 flags=0 cpu=2 > credit=142 [w=256] > (XEN) [2011-12-06 19:13:39] 4: [0.1] pri=-2 flags=0 cpu=1 > credit=-2612 [w=256] > (XEN) [2011-12-06 19:13:39] CPU[00] sort=12511062, > sibling=00000000,00000000,00000000,00000003, > core=00000000,00000000,00000000,000000ff > (XEN) [2011-12-06 19:13:39] run: [32767.0] pri=0 flags=0 cpu=0 > (XEN) [2011-12-06 19:13:39] 1: [0.0] pri=-1 flags=0 cpu=0 > credit=333 [w=256] > (XEN) [2011-12-06 19:13:39] CPU[01] sort=12511063, > sibling=00000000,00000000,00000000,00000003, > core=00000000,00000000,00000000,000000ff > (XEN) [2011-12-06 19:13:39] run: [0.1] pri=-2 flags=0 cpu=1 > credit=-2914 [w=256] > (XEN) [2011-12-06 19:13:39] 1: [32767.1] pri=-64 flags=0 cpu=1 > (XEN) [2011-12-06 19:13:39] CPU[02] sort=12511063, > sibling=00000000,00000000,00000000,0000000c, > core=00000000,00000000,00000000,000000ff > (XEN) [2011-12-06 19:13:39] run: [32767.2] pri=-64 flags=0 cpu=2 > (XEN) [2011-12-06 19:13:39] CPU[03] sort=12511063, > sibling=00000000,00000000,00000000,0000000c, > core=00000000,00000000,00000000,000000ff > (XEN) [2011-12-06 19:13:39] run: [32767.3] pri=-64 flags=0 cpu=3 > (XEN) [2011-12-06 19:13:39] CPU[04] sort=12511063, > sibling=00000000,00000000,00000000,00000030, > core=00000000,00000000,00000000,000000ff > (XEN) [2011-12-06 19:13:39] run: [32767.4] pri=-64 flags=0 cpu=4 > (XEN) [2011-12-06 19:13:39] CPU[05] sort=12511063, > sibling=00000000,00000000,00000000,00000030, > core=00000000,00000000,00000000,000000ff > (XEN) [2011-12-06 19:13:39] run: [1201.1] pri=-2 flags=0 cpu=5 > credit=-3617 [w=256] > (XEN) [2011-12-06 19:13:39] 1: [32767.5] pri=-64 flags=0 cpu=5 > (XEN) [2011-12-06 19:13:39] CPU[06] sort=12511063, > sibling=00000000,00000000,00000000,000000c0, > core=00000000,00000000,00000000,000000ff > (XEN) [2011-12-06 19:13:39] run: [1201.2] pri=-2 flags=0 cpu=6 > credit=-3918 [w=256] > (XEN) [2011-12-06 19:13:39] 1: [32767.6] pri=-64 flags=0 cpu=6 > (XEN) [2011-12-06 19:13:39] CPU[07] sort=12511063, > sibling=00000000,00000000,00000000,000000c0, > core=00000000,00000000,00000000,000000ff > (XEN) [2011-12-06 19:13:39] run: [32767.7] pri=-64 flags=0 cpu=7 > > The system in question has a 4-core i7 CPU where Dom0 is pinned to > core 0-3 and DomU is pinned 4-7. Aren't the negative credit numbers > quite big? > >> >> But also see if the problem disappears with the latest 4.1.x hypervisor. >> And it might also be worth upgrading the dom0 to a 3.0. Hmm, it would be >> very interesting to see where the dom0 _is_ stuck at. The hypervisor is >> running fine so it all points to dom0 crashing somewhere.. >> >> Can you make sure that dom0 runs with 'debug loglevel=8' as well. That >> should give some wealth of information when/if a crash happens. >> Oh, and try to pass in Ctrl-Alt-SysRQ-t (on serial concentrators I think >> you just need to type 'break' on the telnet line)and then t. But I am >> not entirely sure about it - might want to double check Google and see >> how to enable Alt-SysRQ. > > I tried to get SysRq working. It worked fine on some of my other > machines (sysrq= 'shift-control-o'), but not on this broken box. > Apparently the kernel is stuck in some place where SysRq doesn't work. > > I'm going to retest with the relevant logging turned on. I also > upgraded to the latest 2.6.32 kernel (.48 I think). I will also > upgrade to the latest 4.1.x build from the mercurial repository. It > will take a few days before I get crash. > > Thanks so far! > Roderick _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |