[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL



On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions 
> > - FAIL"):
> > > From mere code inspection and document of lwip 1.3.0 I think mini
> > -os
> > > does send gratuitous ARP.
> > 
> > The guest is using the PVHVM drivers at this point, with the backend
> > directly in dom0, so it is the guest's gratuitous arp which is needed,
> > I think.
> 
> It would be worth investigating whether mini-os's gratuitous ARP might
> also be occurring and confusing things, e.g. by coming after and
> therefore taking precedence over the one coming from the guest.
> 

Several observations:

1. The guest doesn't always send gratuitous arp -- but this might not be
   the cause of this failure. Guest works fine when using qemu-trad
   only.
2. Guest only sends one gratuitous arp at most.
3. When using stubdom, guest is a lot less responsive. See two
   experiments and analysis below.

I statically add arp entry for guest interface because arp entry some
times gets deleted. Note that this is not covering up the root cause of
failure because  the arp entry is normally deleted after a few migration
iterations. The failure on merlot* mostly fail on first iteration. And
when arp entry is not available, the error for ssh should be "No route
to host", not "timed out".

Furthermore when the arp entry is not available, dom0 naturally sends an
arp request to guest. When stubdom is not in use, guest responded
instantly, when stubdom is in use, guest was a lot less responsive.

I use a script to repeat migration and ssh.

  i=1
  while true; do
      echo "#### iteration $i"
      ssh localhost xl migrate wheezy-hvm localhost
      if [ $? != 0 ]; then
          echo "migration failed $?";
          exit 1;
      fi 
      timeout 40 ssh -o BatchMode=yes -o ConnectTimeout=100 -o 
ServerAliveInterval=100 root@xxxxxxxxxxxx date
      st=$?
      if [ $st != 0 ]; then
          echo "failed $st";
          exit 1;
      fi 
      i=$((i+1))
  done

At the same time
  tcpdump -i xenbr0 arp and host $GUEST_IP

When stubdom is present.

Scenario 1:
  xl shows "Migration successful."
  ...30s...
  xenbr0 receives gratuitous arp
  ...1s...
  ssh date command comes back

Scenario 2:
  xenbr0 receives gratuitous arp
  ...1s...
  xl shows "Migration successful."
  ssh date command comes back

When stubdom was not present I never saw scenario 1.

Note that my machine is relative old (>6 years). It would never pass
the test in osstest because in osstest the timeout is 10s.

The slowness in osstest seems to be host specific because all failures
in guest migrate test failed on merlot*. It's not only linux-4.1 is
failing, other branches fail the same test step on merlot*, too.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.