Xen project Mailing List

[Xen-users] xen, iscsi and resilience to short network outages

From: "Steve Feehan" <sfeehan@xxxxxxxxx>

Date: Thu, 9 Nov 2006 14:41:18 -0500

Delivery-date: Thu, 09 Nov 2006 11:42:00 -0800

Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=ZxS6Jw/g0QRwjFPyCBewUAc8wBu/RLPHpvNp6OoXNlLKTZgIibtxLOj2CMxeEmgIc/S4tlYGwPOhjeh6AzZy8SQmSTY2VB+QzA0yt7ro5xCHiaJgMlZKQQnS8hL7d9tq63tnWIPFAYFJea9tghWa4e15blSJtksAjZWpbGWiYlU=

List-id: Xen user discussion <xen-users.lists.xensource.com>

Hi. Here is the short version: If dom0 experiences a short (< 120 second) network outage the guests whose disks are on iSCSI LUNs get (seemingly) unrecoverable IO errors. Is it possible to make Xen more resiliant to such problems? And now the full version: We're testing Xen on iSCSI LUNs. The hardware/software configuration is: * Dom0 and guest OS: SLES10 x86_64 * iSCSI LUN on NetApp filer We connect to the LUN through dom0 and then "map" the device to a guest like so: disk = [ 'phy:/dev/disk/by-id/scsi-360a9800043346863483437714a643833,hda,w' ] At around noon today (though it's happend a few times in the last few weeks) one of our switches was powered off. At that time, here is what I see in syslog of dom0: Nov 9 12:06:04 egovxen1 iscsid: connect failed (113) Nov 9 12:06:13 egovxen1 iscsid: connect failed (113) Nov 9 12:06:21 egovxen1 iscsid: connect failed (113) Nov 9 12:06:29 egovxen1 iscsid: connect failed (113) Nov 9 12:06:38 egovxen1 iscsid: connect failed (113) Nov 9 12:06:46 egovxen1 iscsid: connect failed (113) Nov 9 12:06:55 egovxen1 iscsid: connect failed (113) Nov 9 12:07:03 egovxen1 iscsid: connect failed (113) Nov 9 12:07:04 egovxen1 kernel: tg3: peth0: Link is up at 1000 Mbps, full duplex. Nov 9 12:07:04 egovxen1 kernel: tg3: peth0: Flow control is off for TX and off for RX. Nov 9 12:07:04 egovxen1 kernel: xenbr0: port 2(peth0) entering learning state Nov 9 12:07:04 egovxen1 kernel: xenbr0: topology change detected, propagating Nov 9 12:07:04 egovxen1 kernel: xenbr0: port 2(peth0) entering forwarding state Nov 9 12:07:10 egovxen1 kernel: session0: iscsi: session recovery timed out after 120 secs Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: scsi: Device offlined - not ready after error recovery Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: scsi: Device offlined - not ready after error recovery Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: SCSI error: return code = 0x20000 Nov 9 12:07:10 egovxen1 kernel: end_request: I/O error, dev sdd, sector 23349467 Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: SCSI error: return code = 0x20000 Nov 9 12:07:10 egovxen1 kernel: end_request: I/O error, dev sdc, sector 6573193 Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:3: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device Nov 9 12:07:10 egovxen1 kernel: sd 0:0:0:2: rejecting I/O to offline device Nov 9 12:07:11 egovxen1 iscsid: connect failed (113) Nov 9 12:07:20 egovxen1 iscsid: connect failed (113) Nov 9 12:07:28 egovxen1 iscsid: connect failed (113) Nov 9 12:07:36 egovxen1 iscsid: connect failed (113) Nov 9 12:07:44 egovxen1 iscsid: connection0:0 is operational after recovery (19 attempts) So it looks like the iSCSI connection was dropped at 12:06:04 and reestablished at 12:07:44. But during this time the guests who's disks were on the iSCSI LUNs get IO errors and do not recover. Here is what I got when I connected to the console: sfeehan@egovxen1:~> sudo xm console xenlb2 INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: cannot execute "/sbin/mingetty" INIT: Id "1" respawning too fast: disabled for 5 minutes Is it possible to adjust a timeout or otherwise make Xen a bit more tolerant of short network outages? Thanks. -- Steve Feehan _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.