[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] slow live magration / xc_restore on xen4 pvops
On Thursday, 03 June 2010 at 11:01, Ian Jackson wrote: > Brendan Cully writes ("Re: [Xen-devel] slow live magration / xc_restore on > xen4 pvops"): > > 2. in normal migration, the sender should close the fd after sending > > all data, immediately triggering an IO error on the receiver and > > completing the restore. > > This is not true. In normal migration, the fd is used by the > machinery which surrounds xc_domain_restore (in xc_save and also in xl > or xend). In any case it would be quite wrong for a library function > like xc_domain_restore to eat the fd. The sender closes the fd, as it always has. xc_domain_restore has always consumed the entire contents of the fd, because the qemu tail has no length header under normal migration. There's no behavioral difference here that I can see. > It's not necessary for xc_domain_restore to behave this way in all > cases; all that's needed is parameters to tell it how to behave. I have no objection to a more explicit interface. The current form is simply Remus trying to be as invisible as possible to the rest of the tool stack. > > I did try to avoid disturbing regular live migration as much as > > possible when I wrote the code. I suspect some other regression has > > crept in, and I'll investigate. > > The short timeout is another regression. A normal live migration or > restore should not fall over just because no data is available for > 100ms. (the timeout is 1s, by the way). For some reason you clipped the bit of my previous message where I say this doesn't happen: 1. reads are only supposed to be able to time out after the entire first checkpoint has been received (IOW this wouldn't kick in until normal migration had already completed) Let's take a look at read_exact_timed in xc_domain_restore: if ( completed ) { /* expect a heartbeat every HEARBEAT_MS ms maximum */ tv.tv_sec = HEARTBEAT_MS / 1000; tv.tv_usec = (HEARTBEAT_MS % 1000) * 1000; FD_ZERO(&rfds); FD_SET(fd, &rfds); len = select(fd + 1, &rfds, NULL, NULL, &tv); if ( !FD_ISSET(fd, &rfds) ) { fprintf(stderr, "read_exact_timed failed (select returned %zd)\n", len); return -1; } } 'completed' is not set until the first entire checkpoint (i.e., the entirety of non-Remus migration) has completed. So, no issue. I see no evidence that Remus has anything to do with the live migration performance regression discussed in this thread, and I haven't seen any other reported issues either. I think the mlock issue is a much more likely candidate. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |