[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Questions about GPLPV stability tests
On Fri, Dec 09, 2011 at 02:02:31PM -0800, Roderick Colenbrander wrote: > One interesting observation. This morning I had another of my stress > machines with the problem. I never had it on this problem before and > it didn't have any of the new logging / software updates yet. > > The system was in the same state as the other machine which I reported > about before. I tried SysRq stuff and other things. While I was about > to reboot the system, a login prompt appeared again on the VGA. I > don't know whether any of the stuff I did triggered it or not. Anyway > it means Linux is not really dead. I tried logging, but I don't even > see characters appearing. The system feels to be very, very, very > slow. > Hmm.. I wonder if this is the same issue I'm sometimes seeing on my laptop.. suddenly it starts becoming slower, and after a while it's *very* slow.. not dead, but unusably slow.. I haven't had time to (try to) debug it.. -- Pasi > The other tests are still running. > > Roderick > > On Thu, Dec 8, 2011 at 5:33 PM, Roderick Colenbrander > <thunderbird2k@xxxxxxxxx> wrote: > > On Thu, Dec 8, 2011 at 11:46 PM, Konrad Rzeszutek Wilk > > <konrad@xxxxxxxxxx> wrote: > >> On Wed, Dec 07, 2011 at 08:44:19PM +0000, Roderick Colenbrander wrote: > >>> On Wed, Dec 7, 2011 at 8:38 PM, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx> > >>> wrote: > >>> >> It took about a week, but the systems went down again. Linux is down, > >>> >> but the hypervisor is still reachable on the serial console. Is there > >>> >> anything interesting to dump from there? > >>> >> > >>> > Just the interrupts information. I think that is Ctrl-A, Ctrl-A, > >>> > Ctrl-A, and then '*' to capture everything (I don't remember the right > >>> > one for just interrupts). > >>> > >>> Here's the interrupt information: > >>> (XEN) [2011-12-06 19:13:37] [i: dump interrupt bindings] > >>> (XEN) [2011-12-06 19:13:37] Guest interrupt information: > >>> (XEN) [2011-12-06 19:13:37] IRQ: 0 > >>> affinity:00000000,00000000,00000000,00000001 vec:f0 type=IO-APIC-edge > >>> status=00000000 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 1 > >>> affinity:00000000,00000000,00000000,00000001 vec:30 type=IO-APIC-edge > >>> status=00000014 in-flight=0 domain-list=0: 1(-S--), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 2 > >>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:e2 type=XT-PIC > >>> status=00000000 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 3 > >>> affinity:00000000,00000000,00000000,00000001 vec:38 type=IO-APIC-edge > >>> status=00000006 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 4 > >>> affinity:00000000,00000000,00000000,00000001 vec:40 type=IO-APIC-edge > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 5 > >>> affinity:00000000,00000000,00000000,00000001 vec:f1 type=IO-APIC-edge > >>> status=00000000 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 6 > >>> affinity:00000000,00000000,00000000,00000001 vec:48 type=IO-APIC-edge > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 7 > >>> affinity:00000000,00000000,00000000,00000001 vec:50 type=IO-APIC-edge > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 8 > >>> affinity:00000000,00000000,00000000,00000001 vec:58 type=IO-APIC-edge > >>> status=00000010 in-flight=0 domain-list=0: 8(-S--), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 9 > >>> affinity:00000000,00000000,00000000,00000001 vec:60 type=IO-APIC-level > >>> status=00000010 in-flight=0 domain-list=0: 9(-S--), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 10 > >>> affinity:00000000,00000000,00000000,00000001 vec:68 type=IO-APIC-edge > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 11 > >>> affinity:00000000,00000000,00000000,00000001 vec:70 type=IO-APIC-edge > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 12 > >>> affinity:00000000,00000000,00000000,00000001 vec:78 type=IO-APIC-edge > >>> status=00000010 in-flight=0 domain-list=0: 12(-S--), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 13 > >>> affinity:00000000,00000000,00000000,00000001 vec:88 type=IO-APIC-edge > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 14 > >>> affinity:00000000,00000000,00000000,00000001 vec:90 type=IO-APIC-edge > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 15 > >>> affinity:00000000,00000000,00000000,00000001 vec:98 type=IO-APIC-edge > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:37] IRQ: 16 > >>> affinity:00000000,00000000,00000000,00000001 vec:c8 type=IO-APIC-level > >>> status=00000010 in-flight=0 domain-list=0: 16(-S--),1201: 16(--M-), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 18 > >>> affinity:00000000,00000000,00000000,00000001 vec:51 type=IO-APIC-level > >>> status=00000010 in-flight=0 domain-list=0: 18(-S--), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 19 > >>> affinity:00000000,00000000,00000000,00000001 vec:d0 type=IO-APIC-level > >>> status=00000010 in-flight=0 domain-list=0: 19(-S--), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 21 > >>> affinity:00000000,00000000,00000000,00000001 vec:61 type=IO-APIC-level > >>> status=00000010 in-flight=0 domain-list=0: 21(-S--), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 23 > >>> affinity:00000000,00000000,00000000,00000001 vec:59 type=IO-APIC-level > >>> status=00000010 in-flight=1 domain-list=0: 23(PS-M), > >>> (XEN) [2011-12-06 19:13:37] IRQ: 24 > >>> affinity:00000000,00000000,00000000,00000001 vec:28 type=DMA_MSI > >>> status=00000000 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 25 > >>> affinity:00000000,00000000,00000000,00000001 vec:a0 type=PCI-MSI > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 26 > >>> affinity:00000000,00000000,00000000,00000001 vec:a8 type=PCI-MSI > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 27 > >>> affinity:00000000,00000000,00000000,00000001 vec:b0 type=PCI-MSI > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 28 > >>> affinity:00000000,00000000,00000000,00000001 vec:b8 type=PCI-MSI > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 29 > >>> affinity:00000000,00000000,00000000,00000001 vec:c0 type=PCI-MSI > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 30 > >>> affinity:00000000,00000000,00000000,00000001 vec:d8 type=PCI-MSI > >>> status=00000014 in-flight=0 domain-list=0:274(PS--), > >>> (XEN) [2011-12-06 19:13:38] IRQ: 31 > >>> affinity:00000000,00000000,00000000,00000001 vec:21 type=PCI-MSI > >>> status=00000010 in-flight=0 domain-list=0:273(PS--), > >>> (XEN) [2011-12-06 19:13:38] IRQ: 32 > >>> affinity:00000000,00000000,00000000,00000001 vec:29 type=PCI-MSI > >>> status=00000010 in-flight=0 domain-list=0:272(PS--), > >>> (XEN) [2011-12-06 19:13:38] IRQ: 33 > >>> affinity:00000000,00000000,00000000,00000001 vec:31 type=PCI-MSI > >>> status=00000010 in-flight=0 domain-list=0:271(PS--), > >>> (XEN) [2011-12-06 19:13:38] IRQ: 34 > >>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:39 type=PCI-MSI > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 35 > >>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:41 type=PCI-MSI > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 36 > >>> affinity:ffffffff,ffffffff,ffffffff,ffffffff vec:49 type=PCI-MSI > >>> status=00000002 mapped, unbound > >>> (XEN) [2011-12-06 19:13:38] IRQ: 37 > >>> affinity:00000000,00000000,00000000,00000010 vec:65 type=PCI-MSI > >>> status=00000010 in-flight=0 domain-list=1201: 55(--M-), > >>> (XEN) [2011-12-06 19:13:38] IO-APIC interrupt information: > >>> (XEN) [2011-12-06 19:13:38] IRQ 0 Vec240: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 2: vector=240, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 1 Vec 48: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 1: vector=48, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 3 Vec 56: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 3: vector=56, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 4 Vec 64: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 4: vector=64, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 5 Vec241: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 5: vector=241, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 6 Vec 72: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 6: vector=72, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 7 Vec 80: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 7: vector=80, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 8 Vec 88: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 8: vector=88, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 9 Vec 96: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 9: vector=96, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=level, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 10 Vec104: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 10: vector=104, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 11 Vec112: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 11: vector=112, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 12 Vec120: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 12: vector=120, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 13 Vec136: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 13: vector=136, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 14 Vec144: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 14: vector=144, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 15 Vec152: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 15: vector=152, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=0, > >>> irr=0, trigger=edge, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 16 Vec200: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 16: vector=200, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1, > >>> irr=0, trigger=level, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 18 Vec 81: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 18: vector=81, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1, > >>> irr=0, trigger=level, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 19 Vec208: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 19: vector=208, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1, > >>> irr=0, trigger=level, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 21 Vec 97: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 21: vector=97, > >>> delivery_mode=1, dest_mode=logical, delivery_status=0, polarity=1, > >>> irr=0, trigger=level, mask=0, dest_id:1 > >>> (XEN) [2011-12-06 19:13:38] IRQ 23 Vec 89: > >>> (XEN) [2011-12-06 19:13:38] Apic 0x00, Pin 23: vector=89, > >>> delivery_mode=1, dest_mode=logical, delivery_status=1, polarity=1, > >>> irr=1, trigger=level, mask=0, dest_id:1 > >>> > >>> I noticed some potential interesting things. The system in question is > >>> using PCI passthrough. On the serial console I remember seeing that > >>> the PCI device got unmapped from DomU and got mapped again in Dom0. > >>> The serial console log still had a lot of information about this DomU > >>> which I guess was going down. I guess it wasn't fully down yet. > >> > >> One of the debug informations that gets printed with '*' is the guest > >> PCI passthrough data. Such as which IRQ, BAR, etc are mapped in. Did any > >> of those exist? > > > > There is not much PCI related information. Just interrupt stuff, no > > PCI BARs or other interesting PCI stuff: > > (XEN) [2011-12-06 19:13:20] 03:00.0 - dom 1201 - MSIs < 37 > > > (XEN) [2011-12-06 19:13:20] MSI 37 vec=65 lowest edge assert > > log lowest dest=00000010 mask=0/0/-1 > > > > > >> My original thought of what is going on is that the interrupts either > >> stopped completly working (does not look like - it looks like the > >> interrupts are firring and the event channels that are bound to it have > >> the bits set saying - hey I am pending). Oddly there are bunch of MSI > >> ones that are masked which usually means they are disabled. > >> > >> Then there is the > >> affinity:00000000,00000000,00000000,00000001 vec:c8 type=IO-APIC-level > >> status=00000010 in-flight=0 domain-list=0: 16(-S--),1201: 16(--M-), > >> affinity:00000000,00000000,00000000,00000010 vec:65 type=PCI-MSI > >> status=00000010 in-flight=0 domain-list=1201: 55(--M-), > >> > >> The guest has masked both interrupts - so it is off//dying, but somehow > >> the unmapping has not happend. In other words, what you just analyzed > >> and found out. > >> > >> Sadly, there is no smoking gun.. > >> > >> So a couple of things that I would do is to ensure that the Xen > >> hypervisor boots with 'sync_console console_to_ring' and when this > >> crash happens see if I can get a stack trace from both the Xen > >> hypervisor (there are some debug parameters to get that - I think it is > >> 'r'?), and also from the Linux kernel. > > > > I set up some new tests, so that will take some days to get a crash. > > For what it is worth, I do have a '*' log left from this system and it > > has some stack trace. If you think it is useful, I could send it > > gzipped, but I don't want to spam this list if it may not be useful. > > > > One thing I did notice was the following: > > (XEN) [2011-12-06 19:13:39] active vcpus: > > (XEN) [2011-12-06 19:13:39] 1: [1201.2] pri=-2 flags=0 cpu=6 > > credit=-2411 [w=256] > > (XEN) [2011-12-06 19:13:39] 2: [1201.1] pri=-2 flags=0 cpu=5 > > credit=-2460 [w=256] > > (XEN) [2011-12-06 19:13:39] 3: [0.2] pri=0 flags=0 cpu=2 > > credit=142 [w=256] > > (XEN) [2011-12-06 19:13:39] 4: [0.1] pri=-2 flags=0 cpu=1 > > credit=-2612 [w=256] > > (XEN) [2011-12-06 19:13:39] CPU[00] sort=12511062, > > sibling=00000000,00000000,00000000,00000003, > > core=00000000,00000000,00000000,000000ff > > (XEN) [2011-12-06 19:13:39] run: [32767.0] pri=0 flags=0 cpu=0 > > (XEN) [2011-12-06 19:13:39] 1: [0.0] pri=-1 flags=0 cpu=0 > > credit=333 [w=256] > > (XEN) [2011-12-06 19:13:39] CPU[01] sort=12511063, > > sibling=00000000,00000000,00000000,00000003, > > core=00000000,00000000,00000000,000000ff > > (XEN) [2011-12-06 19:13:39] run: [0.1] pri=-2 flags=0 cpu=1 > > credit=-2914 [w=256] > > (XEN) [2011-12-06 19:13:39] 1: [32767.1] pri=-64 flags=0 cpu=1 > > (XEN) [2011-12-06 19:13:39] CPU[02] sort=12511063, > > sibling=00000000,00000000,00000000,0000000c, > > core=00000000,00000000,00000000,000000ff > > (XEN) [2011-12-06 19:13:39] run: [32767.2] pri=-64 flags=0 cpu=2 > > (XEN) [2011-12-06 19:13:39] CPU[03] sort=12511063, > > sibling=00000000,00000000,00000000,0000000c, > > core=00000000,00000000,00000000,000000ff > > (XEN) [2011-12-06 19:13:39] run: [32767.3] pri=-64 flags=0 cpu=3 > > (XEN) [2011-12-06 19:13:39] CPU[04] sort=12511063, > > sibling=00000000,00000000,00000000,00000030, > > core=00000000,00000000,00000000,000000ff > > (XEN) [2011-12-06 19:13:39] run: [32767.4] pri=-64 flags=0 cpu=4 > > (XEN) [2011-12-06 19:13:39] CPU[05] sort=12511063, > > sibling=00000000,00000000,00000000,00000030, > > core=00000000,00000000,00000000,000000ff > > (XEN) [2011-12-06 19:13:39] run: [1201.1] pri=-2 flags=0 cpu=5 > > credit=-3617 [w=256] > > (XEN) [2011-12-06 19:13:39] 1: [32767.5] pri=-64 flags=0 cpu=5 > > (XEN) [2011-12-06 19:13:39] CPU[06] sort=12511063, > > sibling=00000000,00000000,00000000,000000c0, > > core=00000000,00000000,00000000,000000ff > > (XEN) [2011-12-06 19:13:39] run: [1201.2] pri=-2 flags=0 cpu=6 > > credit=-3918 [w=256] > > (XEN) [2011-12-06 19:13:39] 1: [32767.6] pri=-64 flags=0 cpu=6 > > (XEN) [2011-12-06 19:13:39] CPU[07] sort=12511063, > > sibling=00000000,00000000,00000000,000000c0, > > core=00000000,00000000,00000000,000000ff > > (XEN) [2011-12-06 19:13:39] run: [32767.7] pri=-64 flags=0 cpu=7 > > > > The system in question has a 4-core i7 CPU where Dom0 is pinned to > > core 0-3 and DomU is pinned 4-7. Aren't the negative credit numbers > > quite big? > > > >> > >> But also see if the problem disappears with the latest 4.1.x hypervisor. > >> And it might also be worth upgrading the dom0 to a 3.0. Hmm, it would be > >> very interesting to see where the dom0 _is_ stuck at. The hypervisor is > >> running fine so it all points to dom0 crashing somewhere.. > >> > >> Can you make sure that dom0 runs with 'debug loglevel=8' as well. That > >> should give some wealth of information when/if a crash happens. > >> Oh, and try to pass in Ctrl-Alt-SysRQ-t (on serial concentrators I think > >> you just need to type 'break' on the telnet line)and then t. But I am > >> not entirely sure about it - might want to double check Google and see > >> how to enable Alt-SysRQ. > > > > I tried to get SysRq working. It worked fine on some of my other > > machines (sysrq= 'shift-control-o'), but not on this broken box. > > Apparently the kernel is stuck in some place where SysRq doesn't work. > > > > I'm going to retest with the relevant logging turned on. I also > > upgraded to the latest 2.6.32 kernel (.48 I think). I will also > > upgrade to the latest 4.1.x build from the mercurial repository. It > > will take a few days before I get crash. > > > > Thanks so far! > > Roderick > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |