[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] arp during live migration


  • To: <xen-devel@xxxxxxxxxxxxxxxxxxx>
  • From: "Graham, Simon" <Simon.Graham@xxxxxxxxxxx>
  • Date: Tue, 6 Mar 2007 17:59:31 -0500
  • Delivery-date: Tue, 06 Mar 2007 14:58:40 -0800
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: AcdfPJTYp3d71aDxRxK0gccT+4xaGgAQl6OgADB089A=
  • Thread-topic: [Xen-devel] arp during live migration

> >  > In my case, I NEVER see the gratuitous ARP being sent (confirmed
> > using
> >  > tcpdump on peth0 in Dom0) and the return value from
dev_queue_xmit
> > is
> >  > sometimes 0 and sometimes 2 (that's PLUS 2 -- congestion
> > notification
> >  > [NET_XMIT_CN]).
> >
> > I am seeing the same error, indeed it looks like it is NET_XMIT_CN.
I
> > also see 100% percent loss, the ARP never makes it to the wire in
any
> > of
> > my tests.
> >
> 

I guess no one else is seeing this problem? 

Anyway -- after a fair amount of stumbling around I think I know what
the problem is (but I don't have a solution) -- for a while, I thought
it was an SMP bug in the netfront/netback interaction but, although
there is some dodgy code there, it does seem that it always sends the
gratuitous ARP and the backend always picks it up.

The real problem seems to be in the bridge in Dom0; it seems that the
VIF port to the bridge is always in the disabled state when the ARP is
sent, so it simply gets dropped. 

Why is this? Well, the bridge doesn't enable the port until the VIF is
both up AND has link (netif_carrier_on() has been called) -- this latter
call is not made until netfront connects to netback.

What's more, this change is not passed to the bridge code until the next
time the netwatch worker runs, which could be up to 1s after the
netif_carrier_on() is called... at least, that's how it looks to me...

All of this leads to a ~1s delay setting up the network path plus the
gratuitous ARP is dropped so there can be a MUCH larger network
blackout. If you are trying to get sub-second blackout on migration this
is a big problem!

It seems to me that the right thing to do here is to have the link up on
the VIF in advance of the domain resuming on the target but I'm guessing
that this would cause netback to have conniptions...

All suggestions welcome...
Simon

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.