[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v1] x86/mm: Suppresses vm_events caused by page-walks



On 8/27/18 4:02 PM, Andrew Cooper wrote:
> On 27/08/18 13:53, Razvan Cojocaru wrote:
>> On 8/27/18 3:37 PM, Andrew Cooper wrote:
>>> On 27/08/18 13:12, Jan Beulich wrote:
>>>>>> For NPT, isn't there an error code bit telling you whether the
>>>>>> request was a user or "system" one? If not, some cheating
>>>>>> would be needed (derive from CPL, accepting that e.g.
>>>>>> descriptor table accesses would get mis-attributed), but
>>>>>> that's still not going to involve looking at the PTE flags.
>>>>> The alternative would be to simply walk (without enforcing any flags,
>>>>> and so making the pfec walk parameter unnecessary) to the respective
>>>>> address, and query for _PAGE_ACCESSED and _PAGE_DIRTY only.
>>>>>
>>>>> If _PAGE_ACCESSED is not set, set it and exit.
>>>>> If _PAGE_ACCESSED is set, set _PAGE_DIRTY also and exit.
>>>> Since it's ambiguous in the NPT case - are you talking about
>>>> setting the flags in the guest or host page tables? The
>>>> former, I'm afraid, might not be acceptable (as not always
>>>> being architecturally correct). In anyway feels as if we'd
>>>> been here before, in that this reminds me of you meaning
>>>> to imply from a second walk (with A already set) that it must
>>>> be a write access. I thought we had settled on such an
>>>> implication not being generally correct.
>>> The problem that is trying to be solved is that when operating in
>>> non-root mode, the cpu pagewalk, when trying to set a guest A/D bit in a
>>> write-protected EPT page, takes an EPT violation for a write to a
>>> read-only page.
>>>
>>> Manually setting the A/D bits (as appropriate) and restarting the
>>> instruction is sufficient for it to complete correctly.
>>>
>>> At the moment, every time this happens, a request is sent to the
>>> introspection agent, and the agent calculates that it was due to
>>> pagetable protection, and instructs Xen to emulate the instruction. 
>>> This accounts for 97% (?) of the VMExits, and is unrelated to any of the
>>> real protections which introspection is trying to achieve.
>>>
>>> What Razvan is looking to do is to have Xen skip the "send to the
>>> introspection agent" part as an optimisation, because hardware tells Xen
>>> (as part of the VMExit) when this condition has occurred, and the
>>> vm_event logic has already asked Xen to try and fix up this condition
>>> automatically.
>>>
>>> What can actually be done depends on how A/D bits behave in real hardware.
>>>
>>> Setting access bits for non-leaf entries is definitely fine, and
>>> speculatively setting the access bit is also explicitly permitted by the
>>> spec.  However, I can't find any comment on speculative dirty bits (from
>>> either Intel or AMD), and I've not encountered such a behaviour with the
>>> pagetable work I've been doing.
>> Thanks for the reply!
>>
>> I've forgotten a piece of information that I really should have written
>> here: we would only set the D bit if A is already set and either the
>> page is writable (has _PAGE_RW set) or CR0.WP is 0 (the latter case is
>> admittedly more tricky).
> 
> How about a new function which works similarly to guest-walk-tables, but
> only ever sets A/D bits.
> 
> Given information from hardware, we know the linear address, and that it
> was a problem with the guest pagetables, from which we explicitly know
> that it was from writing an A/D bit to a guest PTE.
> 
> While walking down the levels, set any missing A bits and remember if we
> set any.  If we set A bits, consider ourselves complete and exit back to
> the guest.  If no A bits were set, and the access was a write (which we
> know from the EPT violation information), then set the leaf D bit.
> 
> This should be architecturally correct as it is exclusively derived from
> information provided by the VMExit, and won't cause dirty bits to be
> written in cases where the hardware wouldn't have written them
> (speculative or otherwise).  It does mean that an instruction which
> would need to set A and D bits in the walk will take two EPT violations
> to achieve the end result, but it probably is still quicker than sending
> the vm_event out.

Right, that's pretty much what we were proposing, a basic algoritm of:

if ((pte & A) && (pte & RW)) pte |= D;
pte |= A;

where the if probably becomes:

if ((pte & A) && ((pte & RW) || CR0.WP == 0)) pte |= D;
pte |= A

for the CR0.WP case.

As discussed privately, there's also the case where two VCPUs may try to
set A concurrently, which is what I assume is the case Jan has hinted at.

Another small issue is that we do need to ignore the EPT violation
information as it pertains to reads or writes: that will always be the
page-walk access type, rw.


Thanks,
Razvan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.