[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] The strange case of xen_netback not returning ARP replies



On Wed, May 16, 2012 at 02:18:27PM +0200, Joanna Rutkowska wrote:
> Hello,
> 
> I'm facing a rather strange problem with the netback interface. My setup
> involves a netvm, which has some physical network interfaces assigned,
> and a client VM where a net front is running (exposed as eth0) and which
> is connected to that netvm (via vif42.0 interface, as seen in the netvm
> on the dumps below).
> 
> Now, the netvm has two physical network interfaces assigned:
> 1) A standard Intel AGN (iwlwifi module, interface wlan0) -- this is
> just a PCI devices assigned
> 
> 2) A USB 3G modem (cdc_ncm module, usb0 interface) -- this has been made
> available to the netvm by assigning a whole USB controller, where the 3G
> modem is connected to. This works fine.

There are some patches posted about netback and SKB slots that might
apply to the problem you guys are seeing.

> 
> We do NAT in netvm for the traffic coming on vif* and send it out
> through the default outgoing interface, e.g. wlan0. Now, as long as I
> use the wlan0 for networking all works great. I've been using this setup
> for years, no problem here.
> 
> However, when I switch to usb0 as a default outgoing interface in the
> netvm, something strange happens. The networking works fine via usb0 for
> some time (a few minutes typically), yet suddenly, after enough packets
> got exchanged, the networking stops working.
> 
> When I run tcpdump on the vif* interface I can see that suddenly there
> is nobody (in the netvm) to reply for the ARP requests from the client
> VM (the client vm has Xen ID = 42 in this dump, and IP = .5, and gateway
> = .1):
> 
> [root@netvm user]# tcpdump -ni vif42.0 arp
> tcpdump: WARNING: vif42.0: no IPv4 address assigned
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes
> 13:41:55.031819 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:41:56.031860 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:41:57.031794 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:41:59.287308 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:42:00.283853 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:42:01.283816 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:42:03.231324 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length
> 
> ... and this now continues until no end.
> 
> For comparison, this is how it looks when I use networking via wlan0:
> 
> [root@netvm user]# tcpdump -ni vif42.0 arp
> tcpdump: WARNING: vif42.0: no IPv4 address assigned
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 bytes
> 13:39:00.215883 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:39:00.215911 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28
> 13:39:21.799844 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 28
> 13:39:21.799869 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28
> 
> We can see that every once in a while an ARP request for 10.137.1.1
> appears (a gateway for clientvm, so the netvm), yet this is immediately
> being answered (by netvm, as I understand).
> 
> Now, this behavior seems really strange, because:
> 
> 1) AFAIU, the ARP replies are/should be generated by the netback
> interface in the netvm (vif*).
> 
> 2) It shouldn't matter, for the netback code, how the packets are later
> routed (via wlan0 vs. usb0) to provide this (dummy) arp response?
> 
> 3) ...yet, for some reason, in the case when packets are later routed
> through usb0, the netback is not willing to generate arp response???
> 
> Or am I misunderstanding this, and it is somebody else who is generating
> the arp responses? The final NIC?
> 
> Some additional notes:
> 1) We make sure to set /proc/sys/net/ipv4/conf/vif*/proxy_arp to 1
> 
> 2) When this "arp hang" happens, the networking (via usb0) is still
> working fine in the netvm (i.e. I can do ping google.com from the netvm)
> 
> 3) This has been tested on various VM kernels (in the netvm): 3.0.4,
> 3.2.7, and 3.3.5 -- all exhibit the same behavior.
> 
> 4) Nothing spectacular in the logs of the netvm, however, I can often
> see this crash in the *client* VM:
> 
> [ 1257.228761] ------------[ cut here ]------------
> [ 1257.228767] WARNING: at
> /home/user/qubes-src/kernel/kernel-3.3.5/linux-3.3.5/fs/sysfs/file.c:498
> sysfs_attr_ns+0x93/0xa0()
> [ 1257.228776] sysfs: kobject eth0 without dirent
> [ 1257.228780] Modules linked in: iptable_raw bnep bluetooth rfkill
> ipt_MASQUERADE ipt_REJECT xt_state xt_tcpudp xen_netback iptable_filter
> iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
> ip_tables x_tables xen_netfront microcode pcspkr u2mfn(O) xen_blkback
> xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [last
> unloaded: scsi_wait_scan]
> [ 1257.228819] Pid: 11, comm: xenwatch Tainted: G        W  O
> 3.3.5-1.pvops.qubes.x86_64 #1
> [ 1257.228825] Call Trace:
> [ 1257.228830]  [<ffffffff810495aa>] warn_slowpath_common+0x7a/0xb0
> [ 1257.228836]  [<ffffffff81049681>] warn_slowpath_fmt+0x41/0x50
> [ 1257.228842]  [<ffffffff81057ba7>] ? lock_timer_base+0x37/0x70
> [ 1257.228850]  [<ffffffff811a7433>] sysfs_attr_ns+0x93/0xa0
> [ 1257.228856]  [<ffffffff811a7aef>] sysfs_remove_file+0x1f/0x40
> [ 1257.228862]  [<ffffffff812e5622>] device_remove_file+0x12/0x20
> [ 1257.228870]  [<ffffffffa00faf5a>] xennet_remove+0x84/0xac [xen_netfront]
> [ 1257.228875]  [<ffffffff812b5c82>] xenbus_dev_remove+0x42/0xa0
> [ 1257.228881]  [<ffffffff812e85a7>] __device_release_driver+0x77/0xd0
> [ 1257.228887]  [<ffffffff812e86e8>] device_release_driver+0x28/0x40
> [ 1257.228895]  [<ffffffff812e790f>] bus_remove_device+0x10f/0x180
> [ 1257.228901]  [<ffffffff812e5808>] device_del+0x118/0x1c0
> [ 1257.228906]  [<ffffffff812e58cd>] device_unregister+0x1d/0x60
> [ 1257.228914]  [<ffffffff812b5a46>] xenbus_dev_changed+0x96/0x1b0
> [ 1257.228920]  [<ffffffff812b74b4>] frontend_changed+0x24/0x50
> [ 1257.228926]  [<ffffffff812b4221>] xenwatch_thread+0xb1/0x170
> [ 1257.228933]  [<ffffffff8106aea0>] ? wake_up_bit+0x40/0x40
> [ 1257.228939]  [<ffffffff812b4170>] ? xenbus_thread+0x40/0x40
> [ 1257.228944]  [<ffffffff8106a9a6>] kthread+0x96/0xa0
> [ 1257.228951]  [<ffffffff81465724>] kernel_thread_helper+0x4/0x10
> [ 1257.228959]  [<ffffffff8145c7fc>] ? retint_restore_args+0x5/0x6
> [ 1257.228964]  [<ffffffff81465720>] ? gs_change+0x13/0x13
> [ 1257.228968] ---[ end trace 75286ef58ce0391f ]---
> 
> But this seems rather irrelevant, as it seems like it is the netvm that
> is failing here, i.e. it doesn't generate ARP responses?
> 
> I would appreciate any help with this issue!
> 
> Thanks,
> joanna.
> 



> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.