[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs



On Mon, May 6, 2019 at 12:30 PM Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>
> On 06/05/2019 18:41, Tamas K Lengyel wrote:
> > Hi Andrew,
> > thanks for helping brainstorming on this.
> >
> >> How exactly does DRAKVUF go about injecting silent breakpoints?  It 
> >> obviously has to allocate a new gfn from somewhere to begin with.  Do the 
> >> bifurcated frames end up in two different altp2ms, or one in the host p2m 
> >> and one in an alternative?  Does #VE ever get used?
> > I've posted a blog entry about it a while ago, it's still accurate:
> > https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m.
>
> Talking of, have we fixed the emulation of `sti`?  I don't recall any
> changes, but given our aim to get the emulator complete, we should fix it.
>
> > You can't add new frames to only some of the altp2m's - at least not
> > with the current interfaces. All the shadow pages are added to the
> > hostp2m and then in the altp2m the GFN is remapped to the mfn of the
> > shadow page with an execute-only permissions.
>
> Ah - of course.  gfns only make sense in the context of the hostp2m.
>
> > This way the breakpoint
> > can be written into the shadow-page and any attempt to read it can be
> > safely handled on a per-vCPU base by switching it back to the hostp2m
> > for the duration of a singlestep (with MTF). Setting up the shadow
> > pages is only safe to do during the initial setup while the altp2m
> > view is not used and the guest is paused. Once altp2m views are being
> > used adding new pages to the hostp2m results in losing all altp2m
> > settings. For the most part this limitation is not an issue because
> > all supported use-cases add the breakpoints once during the initial
> > setup and there are no breakpoints added later during runtime.
>
> What do the host p2m permissions get set to?  How do you cope with
> future reuse of the gfn for a different purpose later?

The hostp2m permissions aren't changed. The active view is always the
altp2m, the hostp2m is only ever being used while singlestepping. A
write-violation in the altp2m always triggers a full recopy of the
page to its shadow and redeployment of any breakpoints active on the
page after the singlestep is finished and we are switching back to the
altp2m. Since we are monitoring the guest kernel after it's already
booted up, the pages that are trapped are stable. Monitoring code that
may be paged in and loaded back to a different physical location is
not supported at the moment.

>
> >
> > We've noticed that trapping MOV-TO-CR3 with the latest version of
> > Windows 10 has a lot of issues in terms of overhead when KPTI is used,
> > so as a band-aid solution it can be disabled to improve performance
> > (which Mathieu already did).
>
> Meltdown isn't subtle with its perf problems...  What purpose are you
> trapping %cr3 writes for?  Simply auditing the pagetables in use?  If
> so, VT-x has (since forever, iirc) had the CR3 target list (of 4
> entries) which Xen can use to whitelist "safe" %cr3 values, which bypass
> the VMExit.  If all you care about is that the vcpu stays on known-good
> pagetables, this interface could be plumbed up to include the kernel and
> user pagetables, which will avoid all the vmexits from syscalls due to
> meltdown.


CR3 trapping is primarily used to just keep track of where the KPCR is
for the active vCPU. This speeds up finding the thread/process base to
get standard info about the execution state (pid, process name, etc).
I'm aiming to get rid of this, hence the recent patches that fix the
shadow_gs and adds the gdtr_base to the vm_event structure so we can
gather these even if the process is in ring3. There are some plugins
that use it for other custom purposes but those aren't (as) critical.
So the CR3 whitelist isn't really something that applies for this
usecase.

> Alternatively, in some copious free time, once I've got the CPUID/MSR
> interface in a better state, we could fake up MSR_ARCH_CAPS.RDCL_NO so
> the guest doesn't turn on its meltdown mitigations in the first place.

That would be a nicer work-around, although I would still prefer not
having to trick the guest into a state that could be easily
fingerprinted - ie. I want that Windows install to be just like any
other Windows install under Xen. Otherwise it would be easy to spot
that "oh this is the sandbox version".

> >> Given how many EPT flushing bugs I've already found in this area, I 
> >> wouldn't be surprised if there are further ones lurking.  If it is an EPT 
> >> flushing bug, this delta should make it go away, but it will come with a 
> >> hefty perf hit.
> > My understanding is that the VPID implementation in Xen is such that
> > effectively all VMEXITs will trigger assignment of a new VPID to the
> > vCPU - which is likely a performance issue in itself - so flushing the
> > EPT is likely not going to make a difference. But it's worth a shot,
> > maybe it does :)
>
> Sadly, things are far more complicated than that.  For one, Intel still
> owe me a comment/correction to that section of the SDM on INVLPG
> emulation for guests.
>
> Xen's use of ASIDs as a common concept started from the AMD side.  AMD
> strictly only cache linear => host physical mappings, so after any
> change to the p2m, an ASID tick will guarantee to get you a fully clean
> TLB for future pagewalks to populate.
>
> The same is not true for Intel.  VPID and EPT were introduced together,
> and have several kinds of mappings which are cached.  The processor may
> cache:
> 1) linear => gpa mappings (tagged with current VPID and PCID values, and
> contain no information from EPT)
> 2) gpa => hpa mappings (tagged with the current EPTP, may contain other
> data such as the SPP vector, doesn't contain any data from the guest
> pagetables)
> 3) combined mappings which are linear => hpa mappings.
>
> In particular, ticking the VPID after an EPT modification *does not*
> invalidate the gpa=>hpa mappings, so the guest can continue to execute
> using stale mappings.  This is why we've got the logic in
> vmx_vmenter_helper() to calculate if an INVEPT instruction is necessary.

Right, there is a problem if you do that while there are active vCPUs
- each has to trap to Xen to pick up a new VPID to see the EPT
modifications. But in our context we don't modify EPT after the
initial setup and during that setup the whole domain is paused. At
runtime we only perform switching of the EPT for a particular vCPU
which is trapped through vm_event, so it will pick up its new VPID
when it resumes.

> Hence my suggestion for identifying whether it is a real TLB flushing
> issue, or a logical error elsewhere. :)
>

Yea, it's certainly worth a shot. Not like any of this is trivial so I
could be wrong :)

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.