[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 00 of 10] Teach xm save to checkpoint a running domain
I think maybe I forgot to mention that I have successfully checkpointed domains and restored them from checkpoints (with file-system activity between checkpoints). It seems to work pretty well. I'll try to put together a demo of this next week. Regarding full device disconnection, my understanding is that guest domains are already prepared to deal with back-end driver crashes (by maintaining shadows of the ring etc), so a forced reconnect on resume should be able to recover even if there wasn't an orderly shutdown before the suspend. I thought when I looked over the code that the reconnect path did a paranoid forced disconnect first anyway (eg checking for existing event channels and resetting them). On the other hand, if checkpoints are taken more frequently than they are restored, it seems odd to be constantly detaching and reattaching back-ends in the parent. But if this is unsafe, it should be fairly easy to make the code do a full disconnect before suspend. It might be as easy as changing xm save to write 'suspend' to control/shutdown instead of 'checkpoint'. On Friday, 15 December 2006 at 08:07, Steven Hand wrote: > > >I'm not too sure about the last couple of patches in this > >series. Because the checkpointing domain doesn't disconnect before > >calling suspend, it retains a few references to pages it doesn't > >own. These trigger a PT race detector in xc_linux_save, which causes > >it to abort. So the last couple of patches explicitly identify the > >references I've found so far (shared_info and some grant table shared > >pages) and simply zero those PTEs during save, since they'll be > >recreated on restore. Finding the grant table pages is a bit fragile - > >I walk the page table loaded in CR3 at the time of suspend looking for > >the virtual address I've stowed in the suspend record. I've only got > >code for two-level page tables at the moment, since I'm not convinced > >this is the right approach. Under what circumstances would a non-live > >save have an unsafe PTE race? > > Pretty much any PT race in a non-live save/migrate is a bug; the > domain is (in theory) suspended at this point, and all of the > devices are disconnected. Since you've chosen not to 'disconnect' > the devices, you'll get random updates occuring to any shared > pages (shared via grants or directly shared with Xen). > > > Maybe it's fine to simply zero these ptes without checking them. > > I'd think not. to clarify, the pages that have caused races in my experiments are always the same 5: shared_info and four grant table shared pages. The reason these don't cause races in plain save is simply that they are unmapped before suspend is called. Since I've adjusted the kernel to recreate these specific pages on restore (but not in the parent when checkpoint returns), my patches do just zero out the PTEs (simulating in the save code what had previously been done in the guest). Finding the guest grant table pages is a little annoying though. I ended up having the guest put the virtual address of its mapping into an unused field in the suspend record, then walking the page table to find the MFN. I was thinking it might be better to either get Xen to export a list of pages that the guest has references to, or to assume that any unowned MFNs in the page tables are either pages that will be recreated on restore anyway and just zero them out. In short, I wonder how often that PT race code has stopped a non-live save. If the answer is 'never', then zeroing out the PTEs might be fine. Especially since the original domain is still intact after the checkpoint. Thanks again for looking this over. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |