[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Problems with live migrate on xen-testing


  • To: Ian Pratt <m+Ian.Pratt@xxxxxxxxxxxx>
  • From: Niraj Tolia <ntolia@xxxxxxxxx>
  • Date: Tue, 12 Apr 2005 16:49:08 -0400
  • Cc: ian.pratt@xxxxxxxxxxxx, xen-users@xxxxxxxxxxxxxxxxxxx
  • Delivery-date: Tue, 12 Apr 2005 20:49:39 +0000
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:cc:in-reply-to:mime-version:content-type:references; b=Cl2XDPFYU7C/F4m6dmOzfXhmR3mdhK01eWSGZLd9hiRgmFkpdcSNkoqsEfFouqhr7tf8L3VH1t5Y4EJQjugEYrlIUdrg1PtvNUNMYGgWlYXN5++OpBUbBadzkTjBwzaj9xYSzdv/nhjNj0esJoXkI2r6bvLUlolH5d3VCdgy7Xo=
  • List-id: Xen user discussion <xen-users.lists.xensource.com>

On Apr 8, 2005 6:20 AM, Ian Pratt <m+Ian.Pratt@xxxxxxxxxxxx> wrote:
> > I am using xen-testing and trying to migrate a VM between two
> > machines. However, the migrate command fails with the error "Error:
> > errors: suspend failed, Callback timed out".
> 
> It might be worth trying Charles Coffing's patch to the iostream
> handling. I'm planing on applying it when I get a chance.
> 
Hi Ian,

The patch did not make a difference. I do have a few more data points
though. Irrespective of whether the patch is applied, migration
without the --live switch works. Further, even --live seems to work if
all the memory pages are copied in one iteration. However, if xfrd.log
shows that a second iteration has been started, live migration will
fail.

> > After the error on the source machine, while the VM shows up
> > on xm list at the destination, xfrd.log (on the destination)
> > shows that its trying to reload memory pages beyond 100%. The
> > number of memory pages reloaded keeps on going up until I use
> > 'xm destroy'.
> 
> Going beyond 100% is normal behaviour, but obviously it should terminate
> eventually, after doing a number of iterations.
> 
> Posting the xfrd log (after applying the patch) would be interesting.
> 

Applying the patch did not make a difference. However, reducing the
amount of memory provided to the migrated domain changes the error
message. While the live migration still fails, it does not start
trying to keep on loading pages. It now fails with a message like
"Frame number in type 1 page table is out of range".

The xfrd logs from the sender and receiver are attached for a 512 and
256 MB domain configuration.


Also, a few other smaller things.

a) On doing a migrate (without --live), the 'xm migrate' command does
not return control to the shell even after a successful migration. A
Control-C gives the following trace

Traceback (most recent call last):
  File "/usr/sbin/xm", line 9, in ?
    main.main(sys.argv)
  File "/usr/lib/python/xen/xm/main.py", line 808, in main
    xm.main(args)
  File "/usr/lib/python/xen/xm/main.py", line 106, in main
    self.main_call(args)
  File "/usr/lib/python/xen/xm/main.py", line 124, in main_call
    p.main(args[1:])
  File "/usr/lib/python/xen/xm/main.py", line 309, in main
    migrate.main(args)
  File "/usr/lib/python/xen/xm/migrate.py", line 49, in main
    server.xend_domain_migrate(dom, dst, opts.vals.live, opts.vals.resource)
  File "/usr/lib/python/xen/xend/XendClient.py", line 249, in
xend_domain_migrate
    {'op'         : 'migrate',
  File "/usr/lib/python/xen/xend/XendClient.py", line 148, in xendPost
    return self.client.xendPost(url, data)
  File "/usr/lib/python/xen/xend/XendProtocol.py", line 79, in xendPost
    return self.xendRequest(url, "POST", args)
  File "/usr/lib/python/xen/xend/XendProtocol.py", line 143, in xendRequest
    resp = conn.getresponse()
  File "/usr/lib/python2.3/httplib.py", line 778, in getresponse
    response.begin()
  File "/usr/lib/python2.3/httplib.py", line 273, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.3/httplib.py", line 231, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.3/socket.py", line 323, in readline
    data = recv(1)


b) I noticed that if the sender migrates a VM (without --live) and has
a console attached to the domain, CPU utilization hits 100% after
migration until the console is disconnected.

Niraj

> Best,
> Ian
> 
> > The process of xm save/scp <saved image>/xm restore the
> > machine works fine though. Any ideas why live migration would
> > not work? The combined usage of dom0 and domU is around half
> > the physical memory present in the machine.
> >
> > The traceback from xend.log
> >
> > Traceback (most recent call last):
> >   File "/usr/lib/python2.3/site-packages/twisted/internet/defer.py",
> > line 308, in _startRunCallbacks
> >     self.timeoutCall.cancel()
> >   File "/usr/lib/python2.3/site-packages/twisted/internet/base.py",
> > line 82, in cancel
> >     raise error.AlreadyCalled
> > AlreadyCalled: Tried to cancel an already-called event.
> >
> > xfrd.log on the sender is full of "Retry suspend domain
> > (120)" before it says "Unable to suspend domain. (120)" and
> > "Domain appears not to have suspended: 120".
> >
> > Niraj
> >
> > --
> > http://www.cs.cmu.edu/~ntolia
> >
> > _______________________________________________
> > Xen-users mailing list
> > Xen-users@xxxxxxxxxxxxxxxxxxx
> > http://lists.xensource.com/xen-users
> >
> 


-- 
http://www.cs.cmu.edu/~ntolia

Attachment: xfrd.receiver.256MB.log
Description: Binary data

Attachment: xfrd.receiver.512MB.log
Description: Binary data

Attachment: xfrd.sender.256MB.log
Description: Binary data

Attachment: xfrd.sender.512MB.log
Description: Binary data

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.