[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] arp during live migration



>  > In my case, I NEVER see the gratuitous ARP being sent (confirmed
> using
>  > tcpdump on peth0 in Dom0) and the return value from dev_queue_xmit
> is
>  > sometimes 0 and sometimes 2 (that's PLUS 2 -- congestion
> notification
>  > [NET_XMIT_CN]).
> 
> I am seeing the same error, indeed it looks like it is NET_XMIT_CN. I
> also see 100% percent loss, the ARP never makes it to the wire in any
> of
> my tests.
> 

So, I have a little more info now -- it seems that the ARP is being
assembled and passed to the backend driver BUT it is ignoring it because
the VIF link state is down (netif_carrier_ok() is returning FALSE) --
the link goes up shortly after, but the packet has been dropped by this
time.

The actual sequence of events is also a little strange (but *very*
reproducible):

. In the DomU, I see the following at the end of migration:
   . First, netfront sees the backend state change to InitWait - this
causes it to
     attempt to connect the rings and send the ARP (even though the
current state is
     actually Connected).
   . Next, the resume processing runs in netfront (I think this is
expected to run first but
     it does not).
   . Now it sees the back state change to InitWait a second time and
attempts to send the ARP
     a second time.

. In Dom0:
  . The first attempt to send the ARP is completely ignored since the
backend is not
    connected yet (specifically, it hasn't set up the softirq handler)
  . The first thing we see is the front end state changing to Connected
-- this causes
    it to initialize the connection and setup the irq handler
  . Now we see an irq signaled, BUT it is ignored by the backend because
netif_carrier_ok() 
    returns FALSE.
  . The very next thing is the link becomes ready and the backend
completes its state
    change to the Connected state.

It seems to me that problem lies in the fact that the backend sees the
ARP packet before it's finished setting up the vif and ignores it.

I don't know if this is relevant, but Dom0 is running with 2 VCPUs in
this configuration so it's possible that the timing window here was not
seen when Dom0 is run as a uni-processor...

/simgr

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.