[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible



On 12/09/2014 13:18, Jan Beulich wrote:
On 11.09.14 at 20:27, <andrew.cooper3@xxxxxxxxxx> wrote:
On inspection of the libxc logs, I am feeing quite glad I left this
debugging message in:

xenguest-75-save[11876]: xc: detail: Bitmap contained more entries than
expected...
xenguest-83-save[32123]: xc: detail: Bitmap contained more entries than
expected...
xenguest-84-save[471]: xc: detail: Bitmap contained more entries than
expected...
xenguest-88-save[3823]: xc: detail: Bitmap contained more entries than
expected...
xenguest-89-save[4656]: xc: detail: Bitmap contained more entries than
expected...
xenguest-95-save[9379]: xc: detail: Bitmap contained more entries than
expected...
xenguest-98-save[11784]: xc: detail: Bitmap contained more entries than
expected...

This means that periodically, a XEN_DOMCTL_SHADOW_OP_{CLEAN,PEEK}
hypercall gives us back a bitmap with more set bits than
stats.dirty_count which it hands back at the same time.
Which has quite simple an explanation, at least when (as I assume is
the case) XEN_DOMCTL_SHADOW_OP_CLEAN is being used: We
clear

d->arch.paging.log_dirty.{dirty,fault}_count on the each step of the
continuation instead of just on the final one. That code needs moving
down.

Agreed.


This, however, points at a broader problem: The necessary dropping
of the paging lock between continuation steps implies that subsequent
setting of bits in the map would need to distinguish between ranges
where the bitmap already got copied and those still outstanding when
it comes to updating dirty_count. I.e. no matter whether we leave
the current copying/clearing where it is or move it down, the counts
wouldn't be precise.

Having a "logdirty dying" or "logdirty shutdown" state could compensate for this, by no longer having paging_log_dirty() true while the continuations are in progress.


Now both from your description and from looking at current code
I conclude that this is with your new migration code, since current
code uses both counts only for printing messages. Which in turn
raises the question whether use of these counts, getting exported
just as statistics, for anything other than statistics is actually
appropriate. Otoh, if your new migration code doesn't use the
counts for non-statistical, non-logging purposes I still can't see why
things go wrong for you but not for me or osstest.

It is indeed migration v2, which is necessary in XenServer given our recent switch from 32bit dom0 to 64bit. The counts are only used for logging, and debugging purposes; all movement of pages is based off the bits in the bitmap alone. In particular, the dirty count is used as a basis of the statistics for the present iteration of migration. While getting it wrong is not the end of the world, it would certainly be preferable for the count to be accurate.

As for the memory corruption, XenRT usually tests pairs of VMs at a time (32 and 64bit variants) and all operations as back-to-back as possible. Therefore, it is highly likely that a continued operation on one domain intersects with other paging operations on another.

The results (now they have run fully) are 10 tests each. 10 passes without this patch, and 10 failures in similar ways with the patch, spread across a randomly selected set of hardware.

We also see failures with HVM VMs as well, although there are no errors at all from Xen or the toolstack components. Symptoms range from BSODs to simply wedging with not apparent cause (I have not had a live repro to investigate)

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.