Re: [Xen-devel] Question about VPID during MOV-TO-CR3

On Mon, Sep 26, 2016 at 12:24 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>> On 23.09.16 at 22:45, <tamas.lengyel@xxxxxxxxxxxx> wrote:
>> On Fri, Sep 23, 2016 at 9:50 AM, Tamas K Lengyel
>> <tamas.lengyel@xxxxxxxxxxxx> wrote:
>>> On Fri, Sep 23, 2016 at 9:39 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>>>> On 23.09.16 at 17:26, <tamas.lengyel@xxxxxxxxxxxx> wrote:
>>>>> On Fri, Sep 23, 2016 at 2:24 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>>>>>> On 22.09.16 at 19:18, <tamas.lengyel@xxxxxxxxxxxx> wrote:
>>>>>>> So I verified that when CPU-based load exiting is enabled, the TLB
>>>>>>> flush here is critical. Without it the guest kernel crashes at random
>>>>>>> points during boot. OTOH why does Xen trap every guest CR3 update
>>>>>>> unconditionally? While we have features such as the vm_event/monitor
>>>>>>> that may choose to subscribe to that event, Xen traps it even when
>>>>>>> that is not in use. Is that trapping necessary for something else?
>>>>>> Where do you see this being unconditional? construct_vmcs()
>>>>>> clearly avoids setting these intercepts when using EPT. Are you
>>>>>> perhaps suffering from
>>>>>>             /* Trap CR3 updates if CR3 memory events are enabled. */
>>>>>>             if ( v->domain->arch.monitor.write_ctrlreg_enabled &
>>>>>>                  monitor_ctrlreg_bitmask(VM_EVENT_X86_CR3) )
>>>>>>                 v->arch.hvm_vmx.exec_control |= 
>>>>>> in vmx_update_guest_cr()? That'll be rather something for you
>>>>>> or Razvan to explain. Outside of nested VMX I don't see any
>>>>>> other enabling of that intercept (didn't check AMD code on the
>>>>>> assumption that you're working on Intel hardware).
>>>>> So there seems to be two separate paths that lead to the TLB flushing.
>>>>> One is indeed the above case you cited when we enable CR3 monitoring
>>>>> through the monitor interface. However, during domain boot I also see
>>>>> this path being called that is not related to the
>>>>> (XEN) hap.c:739:d1v0 hap_update_paging_modes is calling hap_update_cr3
>>>>> (XEN) hap.c:701:d1v0 HAP update cr3 called
>>>>> (XEN) /src/xen/xen/include/asm/hvm/hvm.h:344:d1v0 HVM update guest cr3
>> called
>>>>> (XEN) vmx.c:1549:d1v0 Update guest CR3 value=0x7a7c4000
>>>>> This path seems to de-activate once the domain is fully booted.
>>>> This late? According to the CR0 handling in
>>>> vmx_update_guest_cr() I would understand it to be enabled only
>>>> while the guest is still in real mode (and even then only on old
>>>> hardware, i.e. without the Unrestricted Guest functionality).
>>> Right, with unrestricted guest support I would assume none of this
>>> would get called - but it does, and quite frequently during domain
>>> boot. The CPU is a Intel(R) Xeon(R) CPU E5-2430.
>> So I experimented with selectively disabling the flushing such that
>> it's done only when coming from a path other then CPU-based CR3 load
>> exiting. I've added a bool to struct vcpu that gets set to 0 every
>> time vmx_vmexit_handler is called, and only gets set to 1 when
>> vmx_cr_access reports a MOV-TO-CR3. Then in the vmx_update_guest_cr
>> the flush only happens as such:
>>         if ( !v->movtocr3 )
>>             hvm_asid_flush_vcpu(v);
>> In the guest I run a test application that allocates a page at a fixed
>> VA, writes a magic value to it, and then keeps spinning on reading the
>> magic value back from the page, checking if it's the same as
>> originally supplied. I lunch this application twice with different
>> magic values, so that if the TLB invalidation is an issue one of the
>> test applications would read back the wrong magic value from the VA
>> using a stale TLB entry. I've verified that same VA in the two
>> applications point to different pages and that those PTEs are not
>> marked global and no PCID is used.
>> [  724] test (struct addr:ffff88003730f330). PGD: 0x3731f000
>> VADDR 0x5000000 -> PADDR 0x73e35000. Global page: 0
>> [  727] test (struct addr:ffff88003681ea20). PGD: 0x777a6000
>> VADDR 0x5000000 -> PADDR 0x75043000. Global page: 0
> I'm surprised. As said before - a mov-to-CR3 cannot be emulated
> without a minimal amount of flushing. No experiments whatsoever
> are suitable to prove the contrary.

That's a pretty strong statement - can you tell me where in the SDM
does it say that exactly? I've went through it couple times already
and I can't find anything that explicitly says that the flushing has
to be performed by the VMM when mov-to-CR3 trapping is enabled. The
closest thing I found was indicating the contrary. Furthermore, if the
flushing is necessary, then how would you explain that there were no
TLB mixups in the above experiment?

>> Both applications work as expected without the VPID flushing taking
>> place. So at least for CPU-based CR3 load exiting it seems that this
>> flush is not necessary. As for why this path gets called during domain
>> boot when the CPU supports Unrestricted Guest mode and it is properly
>> detecting when Xen boots, I'm not sure. However, as we use CPU-based
>> CR3 load exiting quite often when doing VMI, I would prefer to disable
>> this flushing at least for this case. Any thoughts?
> As said before - you'd better direct this question to the VMX
> maintainers, and even better would be to first understand why
> the intercept remains enabled in the first place. After all it's
> quite obvious that most improvement can be expected from not
> enabling it at all, whenever possible. Only if it needs to stay
> enabled over extended periods of a guest's lifetime it would then
> become interesting to see whether the emulation path can be
> improved.

To clarify - mov-to-CR3 trapping is _not_ enabled by default on a
domain. I assumed it is the only path to vmx_update_guest_cr, but I
now further verified that vmx_cr_access does not get called for a
mov-to-CR3 when the domain boots, it only gets called when we enable
it through the monitor system. There is another path leads to a call
to vmx_update_guest_cr for updating CR3 when the domain boots which
seems to require this flushing to happen. That other path I don't care
about - although it's rather odd in itself as well. Now when the
mov-to-CR3 path gets activated the flushing does not seem to be
necessary as my experiment shows and it actually actively breaks
architectural features (global pages and PCID). When we do
introspection this trapping does get enabled and stays on for the
lifetime of the domain. So adding such a big and unnecessary
performance hit is very much undesirable.

I've CC-d the VMX maintainers to see what their perspective is on this.


