[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-users] Problems with live migrate on xen-testing
OK, I've managed to reproduce this: Under 2.0.5 I can make a domain crash in the same fashion if it is under extreme network receive load during the migration. This used to work fine, so we've obviously introduced a bug in the last few months. I'll investigate when I get a chance. We seem to get stuck in a page fault loop writing to an skb's shared info area, passing through the vmalloc fault section of do_page_fault. It looks like the PTE is read only, which is very odd. We just need to figure out how it got that way. This smells like the first real Xen-internal bug in the stable series for several months... Ian > The patch did not make a difference. I do have a few more > data points though. Irrespective of whether the patch is > applied, migration without the --live switch works. Further, > even --live seems to work if all the memory pages are copied > in one iteration. However, if xfrd.log shows that a second > iteration has been started, live migration will fail. > > > > After the error on the source machine, while the VM shows > up on xm > > > list at the destination, xfrd.log (on the destination) shows that > > > its trying to reload memory pages beyond 100%. The number > of memory > > > pages reloaded keeps on going up until I use 'xm destroy'. > > > > Going beyond 100% is normal behaviour, but obviously it should > > terminate eventually, after doing a number of iterations. > > > > Posting the xfrd log (after applying the patch) would be > interesting. > > > > Applying the patch did not make a difference. However, > reducing the amount of memory provided to the migrated domain > changes the error message. While the live migration still > fails, it does not start trying to keep on loading pages. It > now fails with a message like "Frame number in type 1 page > table is out of range". > > The xfrd logs from the sender and receiver are attached for a 512 and > 256 MB domain configuration. > > > Also, a few other smaller things. > > a) On doing a migrate (without --live), the 'xm migrate' > command does not return control to the shell even after a > successful migration. A Control-C gives the following trace > > Traceback (most recent call last): > File "/usr/sbin/xm", line 9, in ? > main.main(sys.argv) > File "/usr/lib/python/xen/xm/main.py", line 808, in main > xm.main(args) > File "/usr/lib/python/xen/xm/main.py", line 106, in main > self.main_call(args) > File "/usr/lib/python/xen/xm/main.py", line 124, in main_call > p.main(args[1:]) > File "/usr/lib/python/xen/xm/main.py", line 309, in main > migrate.main(args) > File "/usr/lib/python/xen/xm/migrate.py", line 49, in main > server.xend_domain_migrate(dom, dst, opts.vals.live, > opts.vals.resource) > File "/usr/lib/python/xen/xend/XendClient.py", line 249, in > xend_domain_migrate > {'op' : 'migrate', > File "/usr/lib/python/xen/xend/XendClient.py", line 148, in xendPost > return self.client.xendPost(url, data) > File "/usr/lib/python/xen/xend/XendProtocol.py", line 79, > in xendPost > return self.xendRequest(url, "POST", args) > File "/usr/lib/python/xen/xend/XendProtocol.py", line 143, > in xendRequest > resp = conn.getresponse() > File "/usr/lib/python2.3/httplib.py", line 778, in getresponse > response.begin() > File "/usr/lib/python2.3/httplib.py", line 273, in begin > version, status, reason = self._read_status() > File "/usr/lib/python2.3/httplib.py", line 231, in _read_status > line = self.fp.readline() > File "/usr/lib/python2.3/socket.py", line 323, in readline > data = recv(1) > > > b) I noticed that if the sender migrates a VM (without > --live) and has a console attached to the domain, CPU > utilization hits 100% after migration until the console is > disconnected. > > Niraj > > > Best, > > Ian > > > > > The process of xm save/scp <saved image>/xm restore the machine > > > works fine though. Any ideas why live migration would not > work? The > > > combined usage of dom0 and domU is around half the > physical memory > > > present in the machine. > > > > > > The traceback from xend.log > > > > > > Traceback (most recent call last): > > > File > "/usr/lib/python2.3/site-packages/twisted/internet/defer.py", > > > line 308, in _startRunCallbacks > > > self.timeoutCall.cancel() > > > File > "/usr/lib/python2.3/site-packages/twisted/internet/base.py", > > > line 82, in cancel > > > raise error.AlreadyCalled > > > AlreadyCalled: Tried to cancel an already-called event. > > > > > > xfrd.log on the sender is full of "Retry suspend domain (120)" > > > before it says "Unable to suspend domain. (120)" and > "Domain appears > > > not to have suspended: 120". > > > > > > Niraj > > > > > > -- > > > http://www.cs.cmu.edu/~ntolia > > > > > > _______________________________________________ > > > Xen-users mailing list > > > Xen-users@xxxxxxxxxxxxxxxxxxx > > > http://lists.xensource.com/xen-users > > > > > > > > -- > http://www.cs.cmu.edu/~ntolia > _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |