[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-users] Problems with live migrate on xen-testing



OK, I've managed to reproduce this: Under 2.0.5 I can make a domain
crash in the same fashion if it is under extreme network receive load
during the migration. 
This used to work fine, so we've obviously introduced a bug in the last
few months.

I'll investigate when I get a chance. We seem to get stuck in a page
fault loop writing to an skb's shared info area, passing through the
vmalloc fault section of do_page_fault. It looks like the PTE is read
only, which is very odd. We just need to figure out how it got that way.
This smells like the first real Xen-internal bug in the stable series
for several months...

Ian

> The patch did not make a difference. I do have a few more 
> data points though. Irrespective of whether the patch is 
> applied, migration without the --live switch works. Further, 
> even --live seems to work if all the memory pages are copied 
> in one iteration. However, if xfrd.log shows that a second 
> iteration has been started, live migration will fail.
> 
> > > After the error on the source machine, while the VM shows 
> up on xm 
> > > list at the destination, xfrd.log (on the destination) shows that 
> > > its trying to reload memory pages beyond 100%. The number 
> of memory 
> > > pages reloaded keeps on going up until I use 'xm destroy'.
> > 
> > Going beyond 100% is normal behaviour, but obviously it should 
> > terminate eventually, after doing a number of iterations.
> > 
> > Posting the xfrd log (after applying the patch) would be 
> interesting.
> > 
> 
> Applying the patch did not make a difference. However, 
> reducing the amount of memory provided to the migrated domain 
> changes the error message. While the live migration still 
> fails, it does not start trying to keep on loading pages. It 
> now fails with a message like "Frame number in type 1 page 
> table is out of range".
> 
> The xfrd logs from the sender and receiver are attached for a 512 and
> 256 MB domain configuration.
> 
> 
> Also, a few other smaller things.
> 
> a) On doing a migrate (without --live), the 'xm migrate' 
> command does not return control to the shell even after a 
> successful migration. A Control-C gives the following trace
> 
> Traceback (most recent call last):
>   File "/usr/sbin/xm", line 9, in ?
>     main.main(sys.argv)
>   File "/usr/lib/python/xen/xm/main.py", line 808, in main
>     xm.main(args)
>   File "/usr/lib/python/xen/xm/main.py", line 106, in main
>     self.main_call(args)
>   File "/usr/lib/python/xen/xm/main.py", line 124, in main_call
>     p.main(args[1:])
>   File "/usr/lib/python/xen/xm/main.py", line 309, in main
>     migrate.main(args)
>   File "/usr/lib/python/xen/xm/migrate.py", line 49, in main
>     server.xend_domain_migrate(dom, dst, opts.vals.live, 
> opts.vals.resource)
>   File "/usr/lib/python/xen/xend/XendClient.py", line 249, in 
> xend_domain_migrate
>     {'op'         : 'migrate',
>   File "/usr/lib/python/xen/xend/XendClient.py", line 148, in xendPost
>     return self.client.xendPost(url, data)
>   File "/usr/lib/python/xen/xend/XendProtocol.py", line 79, 
> in xendPost
>     return self.xendRequest(url, "POST", args)
>   File "/usr/lib/python/xen/xend/XendProtocol.py", line 143, 
> in xendRequest
>     resp = conn.getresponse()
>   File "/usr/lib/python2.3/httplib.py", line 778, in getresponse
>     response.begin()
>   File "/usr/lib/python2.3/httplib.py", line 273, in begin
>     version, status, reason = self._read_status()
>   File "/usr/lib/python2.3/httplib.py", line 231, in _read_status
>     line = self.fp.readline()
>   File "/usr/lib/python2.3/socket.py", line 323, in readline
>     data = recv(1)
> 
> 
> b) I noticed that if the sender migrates a VM (without 
> --live) and has a console attached to the domain, CPU 
> utilization hits 100% after migration until the console is 
> disconnected.
> 
> Niraj
> 
> > Best,
> > Ian
> > 
> > > The process of xm save/scp <saved image>/xm restore the machine 
> > > works fine though. Any ideas why live migration would not 
> work? The 
> > > combined usage of dom0 and domU is around half the 
> physical memory 
> > > present in the machine.
> > >
> > > The traceback from xend.log
> > >
> > > Traceback (most recent call last):
> > >   File 
> "/usr/lib/python2.3/site-packages/twisted/internet/defer.py",
> > > line 308, in _startRunCallbacks
> > >     self.timeoutCall.cancel()
> > >   File 
> "/usr/lib/python2.3/site-packages/twisted/internet/base.py",
> > > line 82, in cancel
> > >     raise error.AlreadyCalled
> > > AlreadyCalled: Tried to cancel an already-called event.
> > >
> > > xfrd.log on the sender is full of "Retry suspend domain (120)" 
> > > before it says "Unable to suspend domain. (120)" and 
> "Domain appears 
> > > not to have suspended: 120".
> > >
> > > Niraj
> > >
> > > --
> > > http://www.cs.cmu.edu/~ntolia
> > >
> > > _______________________________________________
> > > Xen-users mailing list
> > > Xen-users@xxxxxxxxxxxxxxxxxxx
> > > http://lists.xensource.com/xen-users
> > >
> > 
> 
> 
> --
> http://www.cs.cmu.edu/~ntolia
> 

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.