Xen project Mailing List

Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Tamas K Lengyel <tamas@xxxxxxxxxxxxx>

Date: Mon, 6 May 2019 12:51:21 -0600

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Mathieu Tarral <mathieu.tarral@xxxxxxxxxxxxxx>

Delivery-date: Mon, 06 May 2019 18:52:27 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, May 6, 2019 at 12:30 PM Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: > > On 06/05/2019 18:41, Tamas K Lengyel wrote: > > Hi Andrew, > > thanks for helping brainstorming on this. > > > >> How exactly does DRAKVUF go about injecting silent breakpoints? It > >> obviously has to allocate a new gfn from somewhere to begin with. Do the > >> bifurcated frames end up in two different altp2ms, or one in the host p2m > >> and one in an alternative? Does #VE ever get used? > > I've posted a blog entry about it a while ago, it's still accurate: > > https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m. > > Talking of, have we fixed the emulation of `sti`? I don't recall any > changes, but given our aim to get the emulator complete, we should fix it. > > > You can't add new frames to only some of the altp2m's - at least not > > with the current interfaces. All the shadow pages are added to the > > hostp2m and then in the altp2m the GFN is remapped to the mfn of the > > shadow page with an execute-only permissions. > > Ah - of course. gfns only make sense in the context of the hostp2m. > > > This way the breakpoint > > can be written into the shadow-page and any attempt to read it can be > > safely handled on a per-vCPU base by switching it back to the hostp2m > > for the duration of a singlestep (with MTF). Setting up the shadow > > pages is only safe to do during the initial setup while the altp2m > > view is not used and the guest is paused. Once altp2m views are being > > used adding new pages to the hostp2m results in losing all altp2m > > settings. For the most part this limitation is not an issue because > > all supported use-cases add the breakpoints once during the initial > > setup and there are no breakpoints added later during runtime. > > What do the host p2m permissions get set to? How do you cope with > future reuse of the gfn for a different purpose later? The hostp2m permissions aren't changed. The active view is always the altp2m, the hostp2m is only ever being used while singlestepping. A write-violation in the altp2m always triggers a full recopy of the page to its shadow and redeployment of any breakpoints active on the page after the singlestep is finished and we are switching back to the altp2m. Since we are monitoring the guest kernel after it's already booted up, the pages that are trapped are stable. Monitoring code that may be paged in and loaded back to a different physical location is not supported at the moment. > > > > > We've noticed that trapping MOV-TO-CR3 with the latest version of > > Windows 10 has a lot of issues in terms of overhead when KPTI is used, > > so as a band-aid solution it can be disabled to improve performance > > (which Mathieu already did). > > Meltdown isn't subtle with its perf problems... What purpose are you > trapping %cr3 writes for? Simply auditing the pagetables in use? If > so, VT-x has (since forever, iirc) had the CR3 target list (of 4 > entries) which Xen can use to whitelist "safe" %cr3 values, which bypass > the VMExit. If all you care about is that the vcpu stays on known-good > pagetables, this interface could be plumbed up to include the kernel and > user pagetables, which will avoid all the vmexits from syscalls due to > meltdown. CR3 trapping is primarily used to just keep track of where the KPCR is for the active vCPU. This speeds up finding the thread/process base to get standard info about the execution state (pid, process name, etc). I'm aiming to get rid of this, hence the recent patches that fix the shadow_gs and adds the gdtr_base to the vm_event structure so we can gather these even if the process is in ring3. There are some plugins that use it for other custom purposes but those aren't (as) critical. So the CR3 whitelist isn't really something that applies for this usecase. > Alternatively, in some copious free time, once I've got the CPUID/MSR > interface in a better state, we could fake up MSR_ARCH_CAPS.RDCL_NO so > the guest doesn't turn on its meltdown mitigations in the first place. That would be a nicer work-around, although I would still prefer not having to trick the guest into a state that could be easily fingerprinted - ie. I want that Windows install to be just like any other Windows install under Xen. Otherwise it would be easy to spot that "oh this is the sandbox version". > >> Given how many EPT flushing bugs I've already found in this area, I > >> wouldn't be surprised if there are further ones lurking. If it is an EPT > >> flushing bug, this delta should make it go away, but it will come with a > >> hefty perf hit. > > My understanding is that the VPID implementation in Xen is such that > > effectively all VMEXITs will trigger assignment of a new VPID to the > > vCPU - which is likely a performance issue in itself - so flushing the > > EPT is likely not going to make a difference. But it's worth a shot, > > maybe it does :) > > Sadly, things are far more complicated than that. For one, Intel still > owe me a comment/correction to that section of the SDM on INVLPG > emulation for guests. > > Xen's use of ASIDs as a common concept started from the AMD side. AMD > strictly only cache linear => host physical mappings, so after any > change to the p2m, an ASID tick will guarantee to get you a fully clean > TLB for future pagewalks to populate. > > The same is not true for Intel. VPID and EPT were introduced together, > and have several kinds of mappings which are cached. The processor may > cache: > 1) linear => gpa mappings (tagged with current VPID and PCID values, and > contain no information from EPT) > 2) gpa => hpa mappings (tagged with the current EPTP, may contain other > data such as the SPP vector, doesn't contain any data from the guest > pagetables) > 3) combined mappings which are linear => hpa mappings. > > In particular, ticking the VPID after an EPT modification *does not* > invalidate the gpa=>hpa mappings, so the guest can continue to execute > using stale mappings. This is why we've got the logic in > vmx_vmenter_helper() to calculate if an INVEPT instruction is necessary. Right, there is a problem if you do that while there are active vCPUs - each has to trap to Xen to pick up a new VPID to see the EPT modifications. But in our context we don't modify EPT after the initial setup and during that setup the whole domain is paused. At runtime we only perform switching of the EPT for a particular vCPU which is trapped through vm_event, so it will pick up its new VPID when it resumes. > Hence my suggestion for identifying whether it is a real TLB flushing > issue, or a logical error elsewhere. :) > Yea, it's certainly worth a shot. Not like any of this is trivial so I could be wrong :) Tamas _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.