[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [VMI] Possible race-condition in altp2m APIs

  • To: Tamas K Lengyel <tamas@xxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Mon, 6 May 2019 19:30:48 +0100
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= mQINBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABtClBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPokCOgQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86LkCDQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAYkC HwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Mathieu Tarral <mathieu.tarral@xxxxxxxxxxxxxx>
  • Delivery-date: Mon, 06 May 2019 18:31:09 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 06/05/2019 18:41, Tamas K Lengyel wrote:
> Hi Andrew,
> thanks for helping brainstorming on this.
>> How exactly does DRAKVUF go about injecting silent breakpoints?  It 
>> obviously has to allocate a new gfn from somewhere to begin with.  Do the 
>> bifurcated frames end up in two different altp2ms, or one in the host p2m 
>> and one in an alternative?  Does #VE ever get used?
> I've posted a blog entry about it a while ago, it's still accurate:
> https://xenproject.org/2016/04/13/stealthy-monitoring-with-xen-altp2m.

Talking of, have we fixed the emulation of `sti`?  I don't recall any
changes, but given our aim to get the emulator complete, we should fix it.

> You can't add new frames to only some of the altp2m's - at least not
> with the current interfaces. All the shadow pages are added to the
> hostp2m and then in the altp2m the GFN is remapped to the mfn of the
> shadow page with an execute-only permissions.

Ah - of course.  gfns only make sense in the context of the hostp2m.

> This way the breakpoint
> can be written into the shadow-page and any attempt to read it can be
> safely handled on a per-vCPU base by switching it back to the hostp2m
> for the duration of a singlestep (with MTF). Setting up the shadow
> pages is only safe to do during the initial setup while the altp2m
> view is not used and the guest is paused. Once altp2m views are being
> used adding new pages to the hostp2m results in losing all altp2m
> settings. For the most part this limitation is not an issue because
> all supported use-cases add the breakpoints once during the initial
> setup and there are no breakpoints added later during runtime.

What do the host p2m permissions get set to?  How do you cope with
future reuse of the gfn for a different purpose later?

> We've noticed that trapping MOV-TO-CR3 with the latest version of
> Windows 10 has a lot of issues in terms of overhead when KPTI is used,
> so as a band-aid solution it can be disabled to improve performance
> (which Mathieu already did).

Meltdown isn't subtle with its perf problems...  What purpose are you
trapping %cr3 writes for?  Simply auditing the pagetables in use?  If
so, VT-x has (since forever, iirc) had the CR3 target list (of 4
entries) which Xen can use to whitelist "safe" %cr3 values, which bypass
the VMExit.  If all you care about is that the vcpu stays on known-good
pagetables, this interface could be plumbed up to include the kernel and
user pagetables, which will avoid all the vmexits from syscalls due to

Alternatively, in some copious free time, once I've got the CPUID/MSR
interface in a better state, we could fake up MSR_ARCH_CAPS.RDCL_NO so
the guest doesn't turn on its meltdown mitigations in the first place.

>> Given how many EPT flushing bugs I've already found in this area, I wouldn't 
>> be surprised if there are further ones lurking.  If it is an EPT flushing 
>> bug, this delta should make it go away, but it will come with a hefty perf 
>> hit.
> My understanding is that the VPID implementation in Xen is such that
> effectively all VMEXITs will trigger assignment of a new VPID to the
> vCPU - which is likely a performance issue in itself - so flushing the
> EPT is likely not going to make a difference. But it's worth a shot,
> maybe it does :)

Sadly, things are far more complicated than that.  For one, Intel still
owe me a comment/correction to that section of the SDM on INVLPG
emulation for guests.

Xen's use of ASIDs as a common concept started from the AMD side.  AMD
strictly only cache linear => host physical mappings, so after any
change to the p2m, an ASID tick will guarantee to get you a fully clean
TLB for future pagewalks to populate.

The same is not true for Intel.  VPID and EPT were introduced together,
and have several kinds of mappings which are cached.  The processor may
1) linear => gpa mappings (tagged with current VPID and PCID values, and
contain no information from EPT)
2) gpa => hpa mappings (tagged with the current EPTP, may contain other
data such as the SPP vector, doesn't contain any data from the guest
3) combined mappings which are linear => hpa mappings.

In particular, ticking the VPID after an EPT modification *does not*
invalidate the gpa=>hpa mappings, so the guest can continue to execute
using stale mappings.  This is why we've got the logic in
vmx_vmenter_helper() to calculate if an INVEPT instruction is necessary.

Hence my suggestion for identifying whether it is a real TLB flushing
issue, or a logical error elsewhere. :)


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.