[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Dell Poweredge 2650 - heavy IO hangs domU machines; xen 2.0.7, xen kernel 2.6.11.12



Hello:

At the off-list suggestion of another user, we have tried adding
'noirqbalance' to the xen start line in grub, we've disabled USB in the
system BIOS, and we've added 'nousb' to the kernel parameters.

The problem is still there, exactly as before, even with all those changes.

*All* the virtual machines lose network connectivity, not just the ones
involved in the backup. We have an LDAP server VM running on this
hardware that is totally idle when this hang happens. We cannot ping or
ssh into them. We can get a console using 'xm console', but after
entering the userid, the login times out (after 60 seconds) before we
ever get a password prompt.

I still suspect an interrupt problem: it would appear that the tty is
unable to do a disk read to do authentication. At the same time, the
tape backup process hangs.

If we kill the bacula storage daemon on dom0, all of the virtual
machines release and we can log in again. At no point does anything
reboot -- it just hangs, and it's not a fatal hang. If the backup
process stops, whether through a timeout or by forceably stopping the
storage daemon, the virtual machines are again pingable and we can log
in both with ssh or 'xm console'.

We tried monitoring the memory usage during the backup test by running
'top' in separate console windows. Loads were actually modest and there
was plenty of memory remaining on all the virtual machines (over 1 GB in
free RAM in one case).

To recap: this is a Dell *2650*, not a 2850. It has a Serverworks, not
an Intel chipset. The RAID controller is a PERC 3 DC (LSI Logic) which
uses the Megaraid drivers. The controller firmware has been upgraded to
3.35/1.07, the most recent available.

Note also -- dom0 is unaffected. We can still interact with dom0 without
trouble. This hang affects only the virtual machines.

Cheers,

-Stephen-


Stephen Bosch wrote:
> Hello:
> 
> We are running three domU machines on a Dell 2650 and using Bacula to do
> backups to an Exabyte VXA SCSI tape drive attached to the external
> channel of a PERC 3 DC, with a RAID 1 running on the internal channel.
> 
> Xen version is 2.0.7
> Kernel is xen-kernel-2.6.11.12
> 
> We have the bacula storage daemon running on dom0.
> 
> When we begin a large backup (several gigabytes), all of the domU
> machines will lock up, regardless of whether they are involved in the
> backup or not.
> 
> Characteristics of the lockup:
> - We lose all network connectivity to all of them. We cannot ping or ssh
> to them -- you cannot do anything. Even an nmap fails.
> 
> - the dom0 is still running fine.
> 
> - We can 'xm console' to the affected domU's and get a login prompt, but
> we can only enter the login id; the login times out waiting for the
> password prompt.
> 
> 
> Eventually, the bacula backup will time out: at this point, the machines
> come back to life. This takes about 15 - 20 minutes. The backup,
> however, does not complete successfully. In fact, very little happens on
> the backup at all :)
> 
> We're very puzzled by this -- we suspect an interrupt issue, but we
> really don't have a clue where to start looking. Other people seem to
> have reported similar IO-related problems.



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.