[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] phy disks and vifs timing out in DomU

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
From: Anthony Wright <anthony@xxxxxxxxxxxxxxx>
Date: Sat, 30 Jul 2011 18:05:24 +0100
Cc: Ian Campbell <Ian.Campbell@xxxxxxxxxxxxx>, Todd Deshane <todd.deshane@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Sat, 30 Jul 2011 10:07:12 -0700
List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On 29/07/2011 21:01, Konrad Rzeszutek Wilk wrote:
> [Ian, I copied you on this b/c of the netbk issue - read on]
>
>>>>>>> On Thu, Jul 28, 2011 at 7:24 AM, Anthony Wright 
>>>>>>> <anthony@xxxxxxxxxxxxxxx> wrote:
>>>>>>>> I have a 32 bit 3.0 Dom0 kernel running Xen 4.1. I am trying to run a 
>>>>>>>> 32 bit PV DomU with two tap:aio disks, two phy disks & 1 vif. The two 
>>>>>>>> tap:aio disks are working fine, but the phy disks and the vif don't 
>>>>>>>> work and I get the following error messages from the DomU kernel 
>>>>>>>> during boot:
>>>>>>>>
>>>>>>>> [    1.783658] Using IPI No-Shortcut mode
>>>>>>>> [   11.880061] XENBUS: Timeout connecting to device: device/vbd/51729 
>>>>>>>> (state 3)
>>>>>>>> [   11.880072] XENBUS: Timeout connecting to device: device/vbd/51745 
>>>>>>>> (state 3)
> Hm, which version of DomU were these? I wonder if this is related to the 
> 'feature-barrier'
> that is not supported with 3.0. Do you see anything in the DomU about the 
> disks?
> or xen-blkfront? Can you run the guests with 'initcall_debug loglevel=8 
> debug' to see
> if if the blkfront is actually running on those disks.
I have attached the domU console output with these options set.

I have also spent a fair amount of time trying to narrow down the
conditions that cause it, with lots of hardware switching & disk
imaging. The conclusion I came to was that it's not hardware related,
but there's a subtle interaction going on with LVM that's causing the
problem, but I'm struggling to work out how to narrow it down any
further than that.

I started with a setup that works: Machine 1 with HDD 1 (IDE), and a
setup that didn't: Machine 2 with HDD 2 (SATA). Machine 2 has an IDE
port so I unplugged HDD 2 and put HDD 1 in Machine 2 and that setup
worked thus excluding most of the hardware. Next I imaged HDD 3 (SATA)
from HDD 1 (IDE), unplugged HDD 1 and put HDD 3 in Machine 2, and that
setup worked, thus excluding an IDE/SATA issue, and giving me a disk I
could safely play with. The disks are organised into two partitions,
partition 1 is for Dom0, partition 2 is an LVM volume group and is used
for the DomUs. One LV (called Main) in this volume group is used by Dom0
to hold the DomU kernels, config information and other static data &
executables, the rest of the VG is issued as LVs to the various DomUs as
needed with a fair amount of free space left in the VG. I took the Main
LV from HDD 2 (didn't work) and imaged it onto HDD 3 and by judicious LV
renaming booted against this image and the setup failed - great I
thought - it's looks like a very subtle config issue. Next I created a
third LV this time imaged from the Main LV that worked, giving me three
Main LVs (I called them Main-Works, Main-Broken & Main-Testing) and I
simply use lvrename to select the one I wanted as active. However now I
couldn't get the setup to work with any of these three Main LVs
including the one that originally worked. Removing the LVs I had
recently created, and going back to the original Main LV, the setup
started working again.

I'm going to try an up to date version of LVM (the one I'm using is a
little out of date), and see if that makes any difference, but the
version I have at the moment has worked without problem in the past.
> Any idea where the source for those DomU's is? If it is an issue with 
> 'feature-barrier'
> it looks like it can't handle not having that option visible which it should.
>
We build the DomUs with a tightly controlled internal build sysem, so I
have a full manifest for the DomU.
>>> What device does that correspond to (hint: run xl block-list or xm 
>>> block-list)?
>>>
>> The output from block-list is:
>>
>> Vdev  BE  handle state evt-ch ring-ref BE-path
>> 51729 0   764    3     10     10       /local/domain/0/backend/vbd/764/51729
>> 51745 0   764    3     11     11       /local/domain/0/backend/vbd/764/51745
>> 51713 0   764    4     8      8       
>> /local/domain/0/backend/qdisk/764/51713
>> 51714 0   764    4     9      9       
>> /local/domain/0/backend/qdisk/764/51714
>>
>> The two vbds map to two LVM logical volumes in two different volume groups.
> qdisk.. ok so it does swap over to QEMU internal AIO path. From the output it 
> looks
> like the ones that hang are the 'phy' types? Is that right?
>
The ones that hang are phy and are the first two, with vdev numbers of
51729 & 51745.
>> On 29/07/2011 17:06, Konrad Rzeszutek Wilk wrote:
>>>>> I have installed virtually identical systems on two physical machines -
>>>>> identical (and I mean identical) xen, dom0, domU with possibly a
>>> md5sum match?
>> Yes - md5sum match on all the key components, i.e. xen, dom0 kernel,
>> 99.9% of the root filesystem, the domU kernel & 99.9% of the domU
>> filesystem. Where there isn't a precise match is on some of the config
>> files. I don't think these should have any effect, but I will have a go
>> at mirroring the disks (I can't swap disks since one is SATA & the other
>> IDE).
>>
>> I also was having problems with the vif device, and got a kernel bug
>> report that could potentially relate to it. I've attached two syslogs.
> Yeah, that is bad. I actually see a similar issue if I kill forcibly the 
> guests.
> I hadn't yet narrowed it down - .. you are looking to be using 4.1.. But not 
> 4.1.1 right?
I started with 4.1.0, but upgraded to 4.1.1 in the hope that it might
fix the problem. The vif timeouts have happened with both versions, but
I think the kernel errors have only been happening since I upgraded to
xen 4.1.1, however I'm not sure. I've also had a number of kernel Oops
in place of the kernel errors as well.
> Can you describe to me how you get the netbk crash?
The DomU when it realises it has a problem with one of it's disks issues
a warning message and then shuts itself down. The netbk crash happens
partway through that shutdown process, but not when the DomU is touching
the network (as far as I know) - it's issuing SIG KILLs to all
processes. It's always at the same point in the shutdown process, but
the shutdown process pauses at that point for quite a while and since it
doesn't touch the network, I'm not convinced it's triggered by something
that DomU is doing. The netbk crash only happens the first time the DomU
starts up & shuts down, it doesn't happen on subsequent DomU
startup/shutdown cycles. It also doesn't happen if the disks work
correctly. I do have a setup that consistently produces it.
>> 2011 Jul 29 07:02:10 kernel: [   33.242680] vbd vbd-1-51745: 1 mapping 
>> ring-ref 11 port 11                                                          
>>       
>> 2011 Jul 29 07:02:10 kernel: [   33.253038] vif vif-1-0: vif1.0: failed to 
>> map tx ring. err=-12 status=-1                                               
>>   
>> 2011 Jul 29 07:02:10 kernel: [   33.253065] vif vif-1-0: 1 mapping 
>> shared-frames 768/769 port 12                                                
>>           
>> 2011 Jul 29 07:02:43 kernel: [   66.103514] vif vif-1-0: 2 reading script    
>>                                                                              
>> 2011 Jul 29 07:02:43 kernel: [   66.106265] br-internal: port 1(vif1.0) 
>> entering disabled state                                                      
>>      
>> 2011 Jul 29 07:02:43 kernel: [   66.106309] libfcoe_device_notification: 
>> NETDEV_UNREGISTER vif1.0                                                     
>>     
>> 2011 Jul 29 07:02:43 kernel: [   66.106333] br-internal: port 1(vif1.0) 
>> entering disabled state                                                      
>>      
>> 2011 Jul 29 07:02:43 kernel: [   66.106372] br-internal: mixed no 
>> checksumming and other settings.                                             
>>            
>> 2011 Jul 29 07:02:43 kernel: [   66.114097] ------------[ cut here 
>> ]------------                                                                
>>           
>> 2011 Jul 29 07:02:43 kernel: [   66.114878] kernel BUG at mm/vmalloc.c:2164! 
>>                                                                              
>> 2011 Jul 29 07:02:43 kernel: [   66.115058] invalid opcode: 0000 [#1] SMP    
>>                                                                              
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] Modules linked in:               
>>                                                                              
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]                                  
>>                                                                              
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] Pid: 20, comm: xenwatch Not 
>> tainted 3.0.0 #1 MSI MS-7309/MS-7309                                         
>>      
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] EIP: 0061:[<c0494bff>] EFLAGS: 
>> 00010203 CPU: 1                                                              
>>   
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] EIP is at free_vm_area+0xf/0x19  
>>                                                                              
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] EAX: 00000000 EBX: cf866480 ECX: 
>> 00000018 EDX: 00000000                                                       
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] ESI: cfa06800 EDI: d076c400 EBP: 
>> cfa06c00 ESP: d0ce7eb4                                                       
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  DS: 007b ES: 007b FS: 00d8 GS: 
>> 0000 SS: 0069                                                                
>>  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] Process xenwatch (pid: 20, 
>> ti=d0ce6000 task=d0c55140 task.ti=d0ce6000)                                  
>>       
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] Stack:                           
>>                                                                              
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  cfa06c00 c09e87aa fffc6e63 
>> c0c4bd65 d0ce7ecc cfa06844 d0ce7ecc d0ce7ecc                                 
>>      
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  cfa06c00 cfa06800 d076c400 
>> cfa06c94 c09eace0 d04cd380 00000000 fffffffe                                 
>>      
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  d0ce7f9c c061fe74 d04cd2e0 
>> d076c420 d076c400 d0ce7f9c c09e9f8c d076c400                                 
>>      
>> 2011 Jul 29 07:02:43 kernel: [   66.115376] Call Trace:                      
>>                                                                              
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c09e87aa>] ? 
>> xen_netbk_unmap_frontend_rings+0xbf/0xd3                                     
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c0c4bd65>] ? 
>> netdev_run_todo+0x1b7/0x1cc                                                  
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c09eace0>] ? 
>> xenvif_disconnect+0xd0/0xe4                                                  
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c061fe74>] ? 
>> xenbus_rm+0x37/0x3e                                                          
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c09e9f8c>] ? 
>> netback_remove+0x40/0x5d                                                     
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c062075d>] ? 
>> xenbus_dev_remove+0x2c/0x3d                                                  
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c06620e6>] ? 
>> __device_release_driver+0x42/0x79                                            
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c06621ac>] ? 
>> device_release_driver+0xf/0x17                                               
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c0661818>] ? 
>> bus_remove_device+0x75/0x84                                                  
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c0660693>] ? 
>> device_del+0xe6/0x125                                                        
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c06606da>] ? 
>> device_unregister+0x8/0x10                                                   
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c06205f0>] ? 
>> xenbus_dev_changed+0x71/0x129                                                
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c0405394>] ? 
>> check_events+0x8/0xc                                                         
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.115376]  [<c061f711>] ? 
>> xenwatch_thread+0xeb/0x113                                                   
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.129624]  [<c044792c>] ? 
>> wake_up_bit+0x53/0x53                                                        
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.129624]  [<c061f626>] ? 
>> xenbus_thread+0x1cc/0x1cc                                                    
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.129624]  [<c0447616>] ? 
>> kthread+0x63/0x68                                                            
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.129624]  [<c04475b3>] ? 
>> kthread_worker_fn+0x122/0x122                                                
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.129624]  [<c0e0f036>] ? 
>> kernel_thread_helper+0x6/0x10                                                
>>                  
>> 2011 Jul 29 07:02:43 kernel: [   66.129624] Code: c1 00 00 00 01 89 f0 e8 a1 
>> ff ff ff 81 6b 08 00 10 00 00 eb 02 31 db 89 d8 5b 5e c3 53 89 c3 8b 40 04 
>> e8 9b ff ff ff 39 d8 74 04 <0f> 0b eb fe 5b e9 73 95 00 00 57 89 d7 56 31 f6 
>> 53 89 c3 eb 09                                                               
>>   
>> 2011 Jul 29 07:02:43 kernel: [   66.129624] EIP: [<c0494bff>] 
>> free_vm_area+0xf/0x19 SS:ESP 0069:d0ce7eb4                                   
>>                
>> 2011 Jul 29 07:02:43 kernel: [   66.129624] ---[ end trace 7bb110af96f32256 
>> ]---                                                                         
>>

Attachment: bootlog-debug
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

References:
- [Xen-devel] phy disks and vifs timing out in DomU
  - From: Anthony Wright
- Re: [Xen-devel] phy disks and vifs timing out in DomU
  - From: Todd Deshane
- Re: [Xen-devel] phy disks and vifs timing out in DomU
  - From: Anthony Wright
- Re: [Xen-devel] phy disks and vifs timing out in DomU
  - From: Todd Deshane
- Re: [Xen-devel] phy disks and vifs timing out in DomU
  - From: Anthony Wright
- Re: [Xen-devel] phy disks and vifs timing out in DomU
  - From: Konrad Rzeszutek Wilk
- Re: [Xen-devel] phy disks and vifs timing out in DomU
  - From: Anthony Wright
- Re: [Xen-devel] phy disks and vifs timing out in DomU
  - From: Konrad Rzeszutek Wilk

Prev by Date: [Xen-devel] Re: ahci errors with xen 4.1.1, linux 3.0, virtualbox
Next by Date: Re: [Xen-devel] Re: xl miss the logic of memory allocation on NUMA platform
Previous by thread: Re: [Xen-devel] phy disks and vifs timing out in DomU
Next by thread: Re: [Xen-devel] phy disks and vifs timing out in DomU
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.