[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL



On Thu, Oct 22, 2015 at 11:39:39AM +0100, Ian Campbell wrote:
> On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote:
> > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote:
> > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote:
> > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030:
> > > > > > regressions 
> > > > > > - FAIL"):
> > > > > > > From mere code inspection and document of lwip 1.3.0 I think
> > > > > > > mini
> > > > > > -os
> > > > > > > does send gratuitous ARP.
> > > > > > 
> > > > > > The guest is using the PVHVM drivers at this point, with the
> > > > > > backend
> > > > > > directly in dom0, so it is the guest's gratuitous arp which is
> > > > > > needed,
> > > > > > I think.
> > > > > 
> > > > > It would be worth investigating whether mini-os's gratuitous ARP
> > > > > might
> > > > > also be occurring and confusing things, e.g. by coming after and
> > > > > therefore taking precedence over the one coming from the guest.
> > > > > 
> > > > 
> > > > Several observations:
> > > > 
> > > > 1. The guest doesn't always send gratuitous arp -- but this might not
> > > > be
> > > >    the cause of this failure. Guest works fine when using qemu-trad
> > > >    only.
> > > 
> > > As in it always sends the arp when using qemu-trad, or that it is fine
> > > irrespective of not always sending it?
> > > 
> > 
> > Whether or not stubdom is in use, the guest behaves the same -- it
> > doesn't always send gratuitous arp.
> > 
> > When using qemu-trad alone, it's always fine when it doesn't send
> > gratuitous arp because either there is cache in dom0 that already has
> > guest mac address or the guest responses instantly to dom0 arp request.
> 
> Where has this cache entry come from? Any preexisting ARP cache would be
> associated with vifX.0 and would go away when that device was destroyed and
> replace with vif(X+1).0.
> 

No, vif-bridge script has two runes for off-lining a vif
  brctl delif $bridge $vif
  ifconfig $vif down

Neither of these causes cache entry to be flushed.

> Also this only work for localhost migration. If the domain actually moved
> to another host then the ARP is required in order for the physical switch
> to learn the new location.
> 
> Thus it seems to me that not always sending the gratuitous ARP is the most
> important thing to get to the bottom of here.
> 

That's another issue, but this would cause other error (no route to
host) instead of timeout. The failure exhibits timeout error -- let's do
one thing at a time.

> > So it comes down to the responsiveness of guest is the key.
> > 
> [...]
> > > > 3. When using stubdom, guest is a lot less responsive. See two
> > > >    experiments and analysis below.
> > > 
> > > Less responsive in use or only while migrating, or to ssh after
> > > migration,
> > > or to something else?
> > > 
> > 
> > For every activity after migration for a period of time, including both
> > arp request / reply and ssh connection.
> > 
> > > > Scenario 1:
> > > >   xl shows "Migration successful."
> > > >   ...30s...
> > > >   xenbr0 receives gratuitous arp
> > > >   ...1s...
> > > >   ssh date command comes back
> > > > 
> > > > Scenario 2:
> > > >   xenbr0 receives gratuitous arp
> > > >   ...1s...
> > > >   xl shows "Migration successful."
> > > >   ssh date command comes back
> > > > 
> > > > When stubdom was not present I never saw scenario 1.
> 
> So in that case you only saw Scenario 2 which includes a "receives
> gratuitous ARP". But above you state that even with non-stub case sometimes
> the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned
> here?
> 

Scenario 3:
  xl shows "Migration successful."
  dom0 sends arp request because arp cache entry not available
  guest takes a long time to respond when using stubdom or responds
    instantly when not using stubdom

Scenario 4:
  xl shows "Migration successful."
  (arp cache entry still available)
  guest takes a long time to respond to ssh when using stubdom or
    responds instantly when not using stubdom

> > > It would be worth looking at the possibility of a delay between
> > > "Migration
> > > successful" and the target domain actually running. A 30s delay between
> > > the
> > > guest restarting and it sending the ARP would be pretty strange IMHO
> > > 
> > 
> > The guest is in a weird state.
> > 
> > xl list shows the stubdom is in "b" state while guest has no state at
> > all, heh.
> 
> Has it actually been started/unpaused then?
> 

Yes, of course -- otherwise the state would have been "p". And I
observed the transition from "p" to "weird state".

Wei.

> Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.