On Fri, 2015-09-04 at 10:57 +0100, Keith Roberts wrote:
Hi all.
I recently updated a box from openSUSE 12.3 to openSUSE 13.1 evergreen,
You might find it beneficial to ask on an openSUSE list or forum or whatever. … Is libvirt using xend or libxl as the underlying toolstack in your configuration?
If libvirtd is using libxl then you _must_ stop the xend daemon altogether since they do not play nice together (although the bugs do not look like the log messages you have here IIRC).
If libvirt is using xend then you obviously need xend running, but I have no idea how the domains from libvirt vs direct ones will interact.
-------------------------------------------------
Here’s an example of /var/log/messages from the Dom-0 VM host server:
[ 3929.511206] blktap_device_fail_pending_requests: 252:7: failing pending read of 11 pages [ 3929.520454] end_request: I/O error, dev tapdevh, sector 21018928 [ 3929.529812] blktap_device_fail_pending_requests: 252:7: failing pending read of 11 pages [ 3929.539240] end_request: I/O error, dev tapdevh, sector 21019016 [ 3929.539250] end_request: I/O error, dev tapdevh, sector 21020040 [ 3929.539272] end_request: I/O error, dev tapdevh, sector 21020128 [ 3929.539290] end_request: I/O error, dev tapdevh, sector 21020216 [ 3929.539307] end_request: I/O error, dev tapdevh, sector 21020304 [ 3929.539325] end_request: I/O error, dev tapdevh, sector 21020392 [ 3929.539346] end_request: I/O error, dev tapdevh, sector 21020480 [ 3929.539365] end_request: I/O error, dev tapdevh, sector 21020568 [ 3929.539387] end_request: I/O error, dev tapdevh, sector 21020656
These might not even be toolstack related, they are from tapdisk. Maybe something broke with that in the upgrade? Or maybe the old and new toolstacks choose different disk backends and the new one has chosen tapdisk which was always buggy but you didn't notice?
Thanks for those pointers Ian.
I will take another look at this and report back with my findings soon.
Well I have tried to replicate this issue on a spare server.
I have cloned the OS image where they VM’s were freezing to the spare server which ran overnight, but there are no error messages appearing in the host server logs like above, or any errors in the test PV VM logs.
I’ve had another look at this and have managed to replicate the issue on a test server.
I started 9 VM’s and installed bonnie++ on each PV VM to stress-test the i/o for all running VM’s.
This is the GRUB boot command:
title Xen -- suse 13.1 production OS image - using Device Mapper ID's - 3.11.10-29-xen root (hd1,6) kernel /boot/xen.gz loglvl=all guest_loglvl=all module /boot/vmlinuz-3.11.10-29-xen root=/dev/disk/by-id/scsi-36848f690ee6632001d2fb03018befc6c-part7 nomodeset module /boot/initrd-3.11.10-29-xen
This replicated the issue occurring on the production server on the test server, producing a similar i/o error with similar results - i.e. the VM’s freezing and locking up, and the host machine becoming unusable. I had to push the power button to shut down the server, and then fsck the RAID drives and re-install the 9 VM’s again.
2015-09-10T09:28:54.130433+01:00 xen-cpp kernel: [70572.192002] BUG: soft lockup - CPU#3 stuck for 22s! [tapdisk2:4115] 2015-09-10T09:28:54.130450+01:00 xen-cpp kernel: [70572.192002] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables af_pack et bridge stp llc nbd blktap blktap2 pciback usbbk xen_scsibk blkbk blkback_pagemap netbk xenbus_be gntdev evtchn coretemp crc32_pclmul crc32c_intel ghash_clmulni_intel aes ni_intel joydev ablk_helper iTCO_wdt iTCO_vendor_support cryptd hid_generic lrw gpio_ich usbhid dcdbas gf128mul tg3 glue_helper domctl aes_x86_64 libphy ses enclosure lpc_ich sb_edac pcspkr acpi_power_meter edac_core sr_mod button mfd_core mei_me mei ptp wmi pps_core ntb 8250 serial_core shpchp sg dm_mod autofs4 ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect i2c_core syscopyarea ehci_pci ehci_hcd usbcore usb_common processor thermal_sys hwmon scsi_dh_hp_sw scsi_dh_emc scsi_dh_rdac scsi_dh_alua scsi_dh xenblk cdrom xennet megaraid_sas 2015-09-10T09:28:54.130453+01:00 xen-cpp kernel: [70572.192002] CPU: 3 PID: 4115 Comm: tapdisk2 Not tainted 3.11.10-29-xen #1
Having Googled for this error and similar errors, it appears the issue could be one of the following:
1) Duplicate domain definitions in xend and libvirt.
2) Barriers and device mapper issues.
3) Kernel bug.
So I will work on trying each of these options to see if/what fixes the issue.
Thanks for all the pointers so far.
Kind Regards,
Keith
|