Xen project Mailing List

Re: [Xen-devel] [PATCH v2] x86/mm: Suppresses vm_events caused by page-walks

To: "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Wed, 19 Sep 2018 08:31:15 -0600

Cc: Tamas K Lengyel <tamas@xxxxxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Razvan Cojocaru <rcojocaru@xxxxxxxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, xen-devel@xxxxxxxxxxxxx, aisaila@xxxxxxxxxxxxxxx

Delivery-date: Wed, 19 Sep 2018 14:31:43 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 19.09.18 at 15:41, <andrew.cooper3@xxxxxxxxxx> wrote: > On 19/09/18 09:53, Jan Beulich wrote: >>>>> On 18.09.18 at 20:20, <andrew.cooper3@xxxxxxxxxx> wrote: >>> On 18/09/18 11:17, Jan Beulich wrote: >>>>>>> On 18.09.18 at 11:47, <aisaila@xxxxxxxxxxxxxxx> wrote: >>>>> On Thu, 2018-09-13 at 08:17 -0600, Jan Beulich wrote: >>>>>>>>> On 12.09.18 at 11:47, <aisaila@xxxxxxxxxxxxxxx> wrote: >>>>>>> The original version of the patch emulated the current instruction >>>>>>> (which, as a side-effect, emulated the page-walk as well), however >>>>>>> we >>>>>>> need finer-grained control. We want to emulate the page-walk, but >>>>>>> still >>>>>>> get an EPT violation event if the current instruction would trigger >>>>>>> one. >>>>>>> This patch performs just the page-walk emulation. >>>>>> Rather than making this basically a revision log, could you please >>>>>> focus >>>>>> on what you actually want to achieve? >>>>>> >>>>>> As to the title: "Suppress ..." please. >>>>>> >>>>>>> @@ -149,6 +151,10 @@ guest_walk_tables(struct vcpu *v, struct >>>>>>> p2m_domain *p2m, >>>>>>> ar_and &= gflags; >>>>>>> ar_or |= gflags; >>>>>>> >>>>>>> + if ( set_ad && set_ad_bits(&l4p[guest_l4_table_offset(va)].l4, >>>>>>> + &gw->l4e.l4, false) ) >>>>>>> + accessed = true; >>>>>> It is in particular this seemingly odd (and redundant with what's >>>>>> done >>>>>> later in the function) which needs thorough explanation. >>>>> On this patch I've followed Andrew Cooper's suggestion on how to set >>>>> A/D Bits: >>>>> >>>>> "While walking down the levels, set any missing A bits and remember if we >>>>> set any. If we set A bits, consider ourselves complete and exit back to >>>>> the guest. If no A bits were set, and the access was a write (which we >>>>> know from the EPT violation information), then set the leaf D bit." >>>>> >>>>> If I misunderstood the comment please clarify. >>>> It doesn't look to me as if you misunderstood anything, but only Andrew >>>> can say for sure. However, none of this was in the description of your >>>> patch (neither as part of the description, nor as code comment), and I >>>> think you'd even have to greatly extend on this in order to explain to >>>> everyone why the resulting behavior is still architecturally correct. In no >>>> case should you assume anyone reading your patch (now or in the >>>> future) has participated in the earlier discussion. >>> The problem we have is that, while we know the EPT Violation was for a >>> write of an A or D bit to a write-protected guest pagetable, we don't >>> know if it was the A or the D bit which was attempting to be set. >>> >>> Furthermore (without emulating the instruction, which is what we are >>> trying to avoid), we can't reconstruct the access. >>> >>> Access bits are only written if they were missing before, but may be set >>> speculatively. Dirty bits are only set when a write is retired. From a >>> practical point of view, the pipeline sets A and D bits as separate actions. >>> >>> Following this logic (and assuming for now a single vcpu), if we get a >>> GPT EPT Violation, and there are missing access bits on the walk, then >>> the fault is definitely from setting an access bit. >> Definitely? > > Yes > >> Is there anything guaranteeing architecturally that an access >> bit related EPT violation would be delivered earlier than any other one >> on that same or a lower page table level? > > No, but why does that matter? > > Architecturally defined or not, we know that the action the processor > was trying to perform was to set an A/D bit, because we got a vmexit > telling us so. Well, as per what I had written further down in my earlier reply (starting with "Or wait ..."), nothing in what you said about the EPT violation made explicit that this was a violation from a write attempt. Without that, it very much matters (afaict), as there are other reasons for getting EPT violations from page table accesses. >> It doesn't matter much for >> the implementation (because of it being permissible to set the A bits >> speculatively, as you also say further down, and any other violation >> then re-occurring after exiting back to the guest once the A bits are >> all set), but since we're discussing here what exactly the patch >> description should contain, I think I'd prefer this to be fully correct >> there. >> >> Or wait - I think I can agree with "definitely", provided you further >> restrict the context: "..., if we get a GPT EPT Write Violation ...". But >> from what I can tell the patch'es change to p2m_mem_access_check() >> doesn't apply (or pass on) any of these qualifications at all. > > I've not looked at the patch in detail yet. I'm tempted to suggest > rearranging guest_walk_tables() to just set the access bits on the > decent, rather than at the end. This matches how some hardware behaves > when pulling entries into the paging structure cache. I'm not opposed to that; I believe this property has changed over the years, as I certainly do recall that early (386ish, 486ish) descriptions of page walks suggested that A bits indeed got set after the full walk only, and I further believe the original implementation simply took this behavior as reference. What I am not very happy about is the addition of a (poorly described) new parameter to the function, making it sort of bail in the middle. >>> Set all access bits >>> and call it done. If we get a GPT EPT Violation and all access bits >>> were set, then it was definitely from setting the Dirty bit. >>> >>> For multi-vcpu scenarios, things get racy. Setting all the Access bits >>> is safe because its a speculative action, but a speculatively load on >>> one vcpu can race with a write (to a read-only mapping) on the other >>> vcpu, and would trick this algorithm into setting the dirty bit when the >>> write would have faulted (and not set the dirty bit). >>> >>> Do we have numbers on how many the GPT EPT Violations are for (only) >>> access sets, and how many are for dirty tsets? Would the first half of >>> the algorithm (which is definitely not racy) still be a net perf win? >> Does Windows make use of A bits at all? I'd expect most OSes to >> simply set them right away, and actively use of the D bits. > > What gives you the expectation that OSes wouldn't use A bits? > > For paging out, the best options are non-accessed non-dirty page because > their contents can be discarded immediately and reread from disk at a > later point. Discard-immediately is solely bound to the D bit. The A bit only tells an OS whether a page was accessed after the bit was cleared the last time. I will surely agree this is a possible strategy, and I was merely wondering whether Windows (other than iirc Linux) actually makes use of this. The reason I wouldn't really expect their use is because repeatedly (and perhaps frequently) clearing A bits from all PTEs is sort of work intensive (as is locating a PTE with its A bit clear). Yet if you don't clear them frequently, the information a set A bit carries does not look to be overly useful. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.