[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible
On 12/09/2014 13:18, Jan Beulich wrote: On 11.09.14 at 20:27, <andrew.cooper3@xxxxxxxxxx> wrote:On inspection of the libxc logs, I am feeing quite glad I left this debugging message in: xenguest-75-save[11876]: xc: detail: Bitmap contained more entries than expected... xenguest-83-save[32123]: xc: detail: Bitmap contained more entries than expected... xenguest-84-save[471]: xc: detail: Bitmap contained more entries than expected... xenguest-88-save[3823]: xc: detail: Bitmap contained more entries than expected... xenguest-89-save[4656]: xc: detail: Bitmap contained more entries than expected... xenguest-95-save[9379]: xc: detail: Bitmap contained more entries than expected... xenguest-98-save[11784]: xc: detail: Bitmap contained more entries than expected... This means that periodically, a XEN_DOMCTL_SHADOW_OP_{CLEAN,PEEK} hypercall gives us back a bitmap with more set bits than stats.dirty_count which it hands back at the same time.Which has quite simple an explanation, at least when (as I assume is the case) XEN_DOMCTL_SHADOW_OP_CLEAN is being used: We clear d->arch.paging.log_dirty.{dirty,fault}_count on the each step of the continuation instead of just on the final one. That code needs moving down. Agreed. This, however, points at a broader problem: The necessary dropping of the paging lock between continuation steps implies that subsequent setting of bits in the map would need to distinguish between ranges where the bitmap already got copied and those still outstanding when it comes to updating dirty_count. I.e. no matter whether we leave the current copying/clearing where it is or move it down, the counts wouldn't be precise. Having a "logdirty dying" or "logdirty shutdown" state could compensate for this, by no longer having paging_log_dirty() true while the continuations are in progress. Now both from your description and from looking at current code I conclude that this is with your new migration code, since current code uses both counts only for printing messages. Which in turn raises the question whether use of these counts, getting exported just as statistics, for anything other than statistics is actually appropriate. Otoh, if your new migration code doesn't use the counts for non-statistical, non-logging purposes I still can't see why things go wrong for you but not for me or osstest. It is indeed migration v2, which is necessary in XenServer given our recent switch from 32bit dom0 to 64bit. The counts are only used for logging, and debugging purposes; all movement of pages is based off the bits in the bitmap alone. In particular, the dirty count is used as a basis of the statistics for the present iteration of migration. While getting it wrong is not the end of the world, it would certainly be preferable for the count to be accurate. As for the memory corruption, XenRT usually tests pairs of VMs at a time (32 and 64bit variants) and all operations as back-to-back as possible. Therefore, it is highly likely that a continued operation on one domain intersects with other paging operations on another. The results (now they have run fully) are 10 tests each. 10 passes without this patch, and 10 failures in similar ways with the patch, spread across a randomly selected set of hardware. We also see failures with HVM VMs as well, although there are no errors at all from Xen or the toolstack components. Symptoms range from BSODs to simply wedging with not apparent cause (I have not had a live repro to investigate) ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |