[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] BUG(?): multipathd confusion leads to kernel panic in Xen 3.2.1-rc2

To: xen-devel@xxxxxxxxxxxxxxxxxxx
From: "Ray Barnes" <tical.net@xxxxxxxxx>
Date: Sat, 5 Apr 2008 06:29:12 -0400
Delivery-date: Wed, 23 Apr 2008 06:16:40 -0700
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=pzK2Z0OIKixN3f09fRuVTV1HS58CcG2uHDuga6EUwZUaswZ/GKP2YeDdwhvshEpgVnEItzgzsaLIvlmUPFVBBus72kbihpPqfW5+uMozMr5o9tdwuuSDU0ugUC3GrHxA9K5aZag86YukhJaozoXHvKU+mAHB88kitP5e8NMnwwY=
List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hi all. While playing with iSCSI Enterprise Target + multipathd on CentOS 5.1 (both the target and the initiator/multipath/xen box are Cent 5.1), I encountered a strange fault condition that leads to a kernel panic in a version of Xen 3.2.1-rc2 pulled from a couple of days ago. My lab consists of two Clovertown machines with dual GigE into separate switches. The target box is softraid5 (although I was able to reproduce this using a single drive on the target), and runs a default config of IET, i.e. 'yum -y install scsi-target-utils ; /etc/init.d/tgtd start'. The config on the initiator needed to reproduce this is default to the best of my recollection. Xen was compiled with 'make world XEN_TARGET_X86_PAE=y vmxassist=n'. /etc/multipath.conf is as follows:

defaults {
        udev_dir                /dev
        polling_interval        2
        selector                "round-robin 0"
        path_grouping_policy    multibus
#       getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        prio_callout            /bin/true
        path_checker            readsector0
        rr_min_io               10
        rr_weight               priorities
        failback                2
        no_path_retry           fail
        user_friendly_name      no
}

The default parameter node.conn[0].timeo.noop_out_interval = 10 on the initiator tells it to "ping" the target once every 10 seconds, then per node.conn[0].timeo.noop_out_timeout = 15, wait 15 seconds before marking the target down. So most of the time it can figure out a path is down in about 20 seconds, but if you catch it just wrong it'll take 25 seconds. Add to that the 2 second polling interval in multipathd. What seems to happen is that when I yank an Ethernet cable, multipathd gets confused and takes 30+ seconds to figure things out (this could be a bug in multipathd). But when that happens, a kernel panic ensues (see below). I have been able to reproduce this in the version of Xen that comes with Cent 5.1, as well as 3.2.0 and 3.2.1-rc2 pulled from hg a couple of days ago with a fresh pull of 2.6.18.8. I can very easily reproduce this every time while installing Cent 5.1 into a domU, it's probably happened 10 times thus far. I can also reproduce easily with a 'dd' inside of a domU that gets its filesystem from the initiator/multipathd, simply by yanking and replugging one of the Ethernet cables a few times. I also reproduced once just running 'dd' directly against the multipathed target device in /dev/mapper from within the dom0. However I tried very hard to reproduce this inside the latest non-Xen kernel of CentOS 5.1 and I could not. It's appears to be a Xen issue, which under no circumstance should crash the entire box. In a final effort to add more substance and background to this, I attempted to yank both cables while running 'dd' in the dom0 to the target. Although it threw a bunch of errors, I did not make it panic. After multipathd marked both paths down, the 'dd' process failed with an io error which is expected behavior. Same thing inside the domU.

Hopefully this helps *someone*. Rather than filing a bug report first, I wanted to describe this here so you guys could maybe blame it on multipath or tell me to go jump in lake minnetonka. If I can provide any more background on this, please let me know, as I should have this lab setup for several more days.

Sincerely,

Ray Barnes

p.s. I'm *extremely* pleased with the quality and quantity of good work going on with Xen in the public domain nowadays; keep up the good work!

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c0302709
27935000 -> *pde = 00000001:17898001
27298000 -> *pme = 00000000:00000000
Oops: 0002 [#1]
SMP
Modules linked in: xt_physdev iptable_filter ip_tables bridge autofs4 hidp rfcomm l2cap bluetooth sunrpc ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 ib_iser rdma_cm ib_addr ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi scsi_transport_iscsi binfmt_misc dm_mirror dm_round_robin dm_multipath dm_mod video thermal sbs processor i2c_ec fan container button battery asus_acpi ac lp nvram sg evdev e1000 parport_pc parport i2c_i801 i2c_core pcspkr piix serio_raw sisfb shpchp pci_hotplug 8250_pnp 8250 serial_core rtc ide_disk ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd usbcore
CPU:    1
EIP:    0061:[<c0302709>]    Not tainted VLI
EFLAGS: 00010286   (2.6.18.8-xen #1)
EIP is at iret_exc+0xc6a/0x105e
eax: 00000000   ebx: 00000000   ecx: 00000007   edx: ed470b40
esi: ed470b50   edi: e68c6490   ebp: 000001f0   esp: ed7a3c84
ds: 007b   es: 007b   ss: 0069
Process swapper (pid: 0, ti=ed7a2000 task=ed79f080 task.ti=ed7a2000)
Stack: 00000034 000001f0 ed470000 c0296f70 ed470b20 e68c6460 000001f0 00000000
       00000000 00000000 e68c6460 00000034 e7de98ac 00000000 00000034 00000514
       e7f8d594 c0294149 e68c642c ed7a3dbc e7f73440 00000000 c02df3d4 00000224
Call Trace:
[<c0296f70>] skb_copy_and_csum_bits+0x140/0x320
[<c0294149>] sock_alloc_send_skb+0x169/0x1c0
[<c02df3d4>] icmp_glue_bits+0x34/0xa0
[<c02be7b3>] ip_append_data+0x623/0xa60
[<c02df3a0>] icmp_glue_bits+0x0/0xa0
[<c02df286>] icmp_push_reply+0x56/0x170
[<c02b7ea1>] ip_route_output_flow+0x21/0x90
[<c02dfc7d>] icmp_send+0x2cd/0x3f0
[<c013d260>] hrtimer_wakeup+0x0/0x20
[<c02b5eec>] ipv4_link_failure+0x1c/0x50
[<c02dd49c>] arp_error_report+0x1c/0x30
[<c02a4158>] neigh_timer_handler+0xf8/0x2c0
[<c012fb0b>] run_timer_softirq+0x13b/0x1f0
[<c02a4060>] neigh_timer_handler+0x0/0x2c0
[<c012a562>] __do_softirq+0x92/0x130
[<c012a679>] do_softirq+0x79/0x80
[<c0107714>] do_IRQ+0x44/0xa0
[<c0248540>] evtchn_do_upcall+0xe0/0x1f0
[<c0105bbd>] hypervisor_callback+0x3d/0x45
[<c0108c7a>] raw_safe_halt+0x9a/0x120
[<c0104709>] xen_idle+0x29/0x50
[<c01036dd>] cpu_idle+0x6d/0xc0
Code: ff e9 f7 6f ea ff 8b 1d 80 32 41 c0 e9 ea c5 ea ff 8b 1d 80 32 41 c0 e9 ff c5 ea ff 8b 15 80 32 41 c0 e9 14 c6 ea ff 8b 5c 24 20 <c7> 03 f2 ff ff ff 8b 7c 24 14 8b 4c 24 18 31 c0 f3 aa e9 60 a7
EIP: [<c0302709>] iret_exc+0xc6a/0x105e SS:ESP 0069:ed7a3c84
<0>Kernel panic - not syncing: Fatal exception in interrupt
(XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

Prev by Date: [Xen-devel] My description of function of TUN TAP interface. Please for correct.
Next by Date: RE: [xen-devel] System time monotonicity
Previous by thread: [Xen-devel] My description of function of TUN TAP interface. Please for correct.
Next by thread: [Xen-devel] [PATCH] AMD IOV: Fix dom0 initialization
Index(es):
- Date
- Thread