[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [Xen-users] kernel 3.9.2 - xen 4.2.2/4.3rc1 => BUG unable to handle kernel paging request netif_poll+0x49c/0xe8
On 07/04/2013 05:01 PM, Wei Liu wrote: > >> I am running into this issue as well with the openSUSE 12.3 >> distribution. This is with their 3.7.10-1.16-xen kernel and Xen version >> 4.2.1_12-1.12.10. On the net I see some discussion of people hitting >> this issue but not that much. E.g., one of the symptoms is that a guest >> crashes when running zypper install or zypper update when the Internet >> connection is fast enough. >> > Do you have references to other reports? I will gather them and post them later. > >> OpenSUSE 3.4.X kernels are running ok as guest on top of the openSUSE >> 12.3 Xen distribution, but apparently since 3.7.10 and higher there is >> this issue. >> >> I spent already quite some time in getting grip on the issue. I added a >> bug to bugzilla.novell.com but no response. See >> https://bugzilla.novell.com/show_bug.cgi?id=826374 for details. >> Apparently for hitting this bug (i.e. make it all the way to the crash), >> it is required to use some hardware which performs not too slow. With >> this I mean it is easy to find hardware which is unable to reproduce the > able? >> issue. >> > I'm not quite sure about what you mean. Do you mean this bug can only > be triggered when your receive path has real hardware NIC invloved? > > And reading your test case below it doesn't seem so. Dom0 to DomU > transmission crashes the guest per your example. Yes, a physical network card is not required. If you do send data to the guest over a physical Ethernet card, it is required to operate it in 1 GbE mode. With a 100 MbE link I am unable to crash the guest. If you do use vif interfaces only, the data rate will be high enough to crash it. However, I have also openSUSE 12.3 Xen configurations running which do not have this issue. It is my feeling that smaller systems (in the sense of less CPU cores and/or less memory bandwidth) do not reveal the issue. > >> In one of my recent experiments I changed the SLAB allocater to SLUB >> which provides more detailed kernel logging. Here is the log output >> after the first detected issue regarding xennet: >> > But the log below is not about SLUB. I cannot understand why SLAB v.s > SLUB makes a difference. I switched to SLUB from SLAB for its debugging functionality. The openSUSE stock kernel used SLAB. > >> Too many frags >> 2013-07-03T23:51:27.092147+02:00 domUA kernel: [ 108.094615] netfront: >> Too many frags >> 2013-07-03T23:51:27.492112+02:00 domUA kernel: [ 108.494255] netfront: >> Too many frags >> 2013-07-03T23:51:27.520194+02:00 domUA kernel: [ 108.522445] > "Too many frags" means your frontend is generating malformed packets. > This is not normal. And apparently you didn't use the latest kernel in > tree because the log message should be "Too many slots" in the latest > OpenSuSE kernel. Yes, I have seen that, but I used the latest openSUSE kernel which belongs to openSUSE 12.3. >> network_alloc_rx_buffers+0x76/0x5f0 [xennet] >> 2013-07-03T23:51:27.679476+02:00 domUA kernel: [ 108.671781] >> netif_poll+0xcf4/0xf30 [xennet] >> 2013-07-03T23:51:27.679478+02:00 domUA kernel: [ 108.671783] >> net_rx_action+0xf0/0x2e0 >> > Seems like there's memory corruption in guest RX path. As Jan already mentioned, it could be related to the kernel panics I obtain, however it may be a different issue as well. >> >> >> I am happy to assist in more kernel probing. It is even possible for me >> to setup access for someone to this machine. >> > Excellent. Last time Jan suspected that we potentially overrun the frag > list of a skb (which would corrupt memory) but it has not been verified. > > I also skimmed your bug report on novell bugzilla which did suggest > memory corruption. > > I wrote a patch to crash the kernel immediately when looping over the > frag list, probably we could start from there? (You might need to adjust > context, but it is only a one-liner which should be easy). > > > Wei. > > ====== > diff --git a/drivers/xen/netfront/netfront.c b/drivers/xen/netfront/netfront.c > index 6e5d233..9583011 100644 > --- a/drivers/xen/netfront/netfront.c > +++ b/drivers/xen/netfront/netfront.c > @@ -1306,6 +1306,7 @@ static RING_IDX xennet_fill_frags(struct netfront_info > *np, > struct sk_buff *nskb; > > while ((nskb = __skb_dequeue(list))) { > + BUG_ON(nr_frags >= MAX_SKB_FRAGS); > struct netif_rx_response *rx = > RING_GET_RESPONSE(&np->rx, ++cons); > Integrated the patch. I obtained a crash dump and the log in it did not show this BUG_ON. Here is the relevant section from the log var/lib/xen/dump/domUA # crash /root/vmlinux-p1 2013-0705-1347.43-domUA.1.core [ 7.670132] Adding 4192252k swap on /dev/xvda1. Priority:-1 extents:1 across:4192252k SS [ 10.204340] NET: Registered protocol family 17 [ 481.534979] netfront: Too many frags [ 487.543946] netfront: Too many frags [ 491.049458] netfront: Too many frags [ 491.491153] ------------[ cut here ]------------ [ 491.491628] kernel BUG at drivers/xen/netfront/netfront.c:1295! [ 491.492056] invalid opcode: 0000 [#1] SMP [ 491.492056] Modules linked in: af_packet autofs4 xennet xenblk cdrom [ 491.492056] CPU 0 [ 491.492056] Pid: 1471, comm: sshd Not tainted 3.7.10-1.16-dbg-p1-xen #8 [ 491.492056] RIP: e030:[<ffffffffa0023aef>] [<ffffffffa0023aef>] netif_poll+0xe4f/0xf90 [xennet] [ 491.492056] RSP: e02b:ffff8801f5803c60 EFLAGS: 00010202 [ 491.492056] RAX: ffff8801f5803da0 RBX: ffff8801f1a082c0 RCX: 0000000180200010 [ 491.492056] RDX: ffff8801f5803da0 RSI: ffff8801fe83ec80 RDI: ffff8801f03b2900 [ 491.492056] RBP: ffff8801f5803e20 R08: 0000000000000001 R09: 0000000000000000 [ 491.492056] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801f03b3400 [ 491.492056] R13: 0000000000000011 R14: 000000000004327e R15: ffff8801f06009c0 [ 491.492056] FS: 00007fc519f3d7c0(0000) GS:ffff8801f5800000(0000) knlGS:0000000000000000 [ 491.492056] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 491.492056] CR2: 00007fc51410c400 CR3: 00000001f1430000 CR4: 0000000000002660 [ 491.492056] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 491.492056] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 491.492056] Process sshd (pid: 1471, threadinfo ffff8801f1264000, task ffff8801f137bf00) [ 491.492056] Stack: [ 491.492056] ffff8801f5803d60 ffffffff8008503e ffff8801f0600a40 ffff8801f0600000 [ 491.492056] 0004328000000040 0000001200000000 ffff8801f5810570 ffff8801f0600a78 [ 491.492056] 0000000000000000 ffff8801f0601fb0 0004326e00000012 ffff8801f5803d00 [ 491.492056] Call Trace: [ 491.492056] [<ffffffff8041ee35>] net_rx_action+0xd5/0x250 [ 491.492056] [<ffffffff800376d8>] __do_softirq+0xe8/0x230 [ 491.492056] [<ffffffff8051151c>] call_softirq+0x1c/0x30 [ 491.492056] [<ffffffff80008a75>] do_softirq+0x75/0xd0 [ 491.492056] [<ffffffff800379f5>] irq_exit+0xb5/0xc0 [ 491.492056] [<ffffffff8036c225>] evtchn_do_upcall+0x295/0x2d0 [ 491.492056] [<ffffffff8051114e>] do_hypervisor_callback+0x1e/0x30 [ 491.492056] [<00007fc519f97700>] 0x7fc519f976ff [ 491.492056] Code: ff 0f 1f 00 e8 a3 c1 40 e0 85 c0 90 75 69 44 89 ea 4c 89 f6 4c 89 ff e8 f0 cb ff ff c7 85 80 fe ff ff ea ff ff ff e9 7c f4 ff ff <0f> 0b ba 12 00 00 00 48 01 d0 48 39 c1 0f 82 bd fc ff ff e9 e9 [ 491.492056] RIP [<ffffffffa0023aef>] netif_poll+0xe4f/0xf90 [xennet] [ 491.492056] RSP <ffff8801f5803c60> [ 491.511975] ---[ end trace c9e37475f12e1aaf ]--- [ 491.512877] Kernel panic - not syncing: Fatal exception in interrupt In the mean time Jan took the bug in bugzilla (https://bugzilla.novell.com/show_bug.cgi?id=826374) and created a first patch. I propose we continue the discussion there and post the conclusion in this list to finish this thread here as well. Dion _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |