[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible

To: Jan Beulich <JBeulich@xxxxxxxx>, Tim Deegan <tim@xxxxxxx>
From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
Date: Mon, 15 Sep 2014 08:50:11 +0100
Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>
Delivery-date: Mon, 15 Sep 2014 07:52:10 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 12/09/2014 13:18, Jan Beulich wrote:

On 11.09.14 at 20:27, <andrew.cooper3@xxxxxxxxxx> wrote:

On inspection of the libxc logs, I am feeing quite glad I left this
debugging message in:

xenguest-75-save[11876]: xc: detail: Bitmap contained more entries than
expected...
xenguest-83-save[32123]: xc: detail: Bitmap contained more entries than
expected...
xenguest-84-save[471]: xc: detail: Bitmap contained more entries than
expected...
xenguest-88-save[3823]: xc: detail: Bitmap contained more entries than
expected...
xenguest-89-save[4656]: xc: detail: Bitmap contained more entries than
expected...
xenguest-95-save[9379]: xc: detail: Bitmap contained more entries than
expected...
xenguest-98-save[11784]: xc: detail: Bitmap contained more entries than
expected...

This means that periodically, a XEN_DOMCTL_SHADOW_OP_{CLEAN,PEEK}
hypercall gives us back a bitmap with more set bits than
stats.dirty_count which it hands back at the same time.

Which has quite simple an explanation, at least when (as I assume is
the case) XEN_DOMCTL_SHADOW_OP_CLEAN is being used: We
clear

d->arch.paging.log_dirty.{dirty,fault}_count on the each step of the
continuation instead of just on the final one. That code needs moving
down.


Agreed.


This, however, points at a broader problem: The necessary dropping
of the paging lock between continuation steps implies that subsequent
setting of bits in the map would need to distinguish between ranges
where the bitmap already got copied and those still outstanding when
it comes to updating dirty_count. I.e. no matter whether we leave
the current copying/clearing where it is or move it down, the counts
wouldn't be precise.

Having a "logdirty dying" or "logdirty shutdown" state could compensatefor this, by no longer having paging_log_dirty() true while thecontinuations are in progress.


Now both from your description and from looking at current code
I conclude that this is with your new migration code, since current
code uses both counts only for printing messages. Which in turn
raises the question whether use of these counts, getting exported
just as statistics, for anything other than statistics is actually
appropriate. Otoh, if your new migration code doesn't use the
counts for non-statistical, non-logging purposes I still can't see why
things go wrong for you but not for me or osstest.

It is indeed migration v2, which is necessary in XenServer given ourrecent switch from 32bit dom0 to 64bit. The counts are only used forlogging, and debugging purposes; all movement of pages is based off thebits in the bitmap alone. In particular, the dirty count is used as abasis of the statistics for the present iteration of migration. Whilegetting it wrong is not the end of the world, it would certainly bepreferable for the count to be accurate.

As for the memory corruption, XenRT usually tests pairs of VMs at a time(32 and 64bit variants) and all operations as back-to-back as possible.Therefore, it is highly likely that a continued operation on one domainintersects with other paging operations on another.

The results (now they have run fully) are 10 tests each. 10 passeswithout this patch, and 10 failures in similar ways with the patch,spread across a randomly selected set of hardware.

We also see failures with HVM VMs as well, although there are no errorsat all from Xen or the toolstack components. Symptoms range from BSODsto simply wedging with not apparent cause (I have not had a live reproto investigate)


~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible
  - From: Jan Beulich

References:
- [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible
  - From: Jan Beulich
- Re: [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible
  - From: Andrew Cooper
- Re: [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible
  - From: Jan Beulich

Prev by Date: Re: [Xen-devel] [PATCH v4 01/16] xen: Add support for VMware cpuid leaves
Next by Date: Re: [Xen-devel] [PATCH v2] x86/current: Provide additional information to optimise get_cpu_info()
Previous by thread: Re: [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible
Next by thread: Re: [Xen-devel] [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.