[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Question about VPID during MOV-TO-CR3



On Tue, Sep 27, 2016 at 7:49 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>> On 26.09.16 at 18:12, <tamas.lengyel@xxxxxxxxxxxx> wrote:
>> On Mon, Sep 26, 2016 at 12:24 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>>> On 23.09.16 at 22:45, <tamas.lengyel@xxxxxxxxxxxx> wrote:
>>>> On Fri, Sep 23, 2016 at 9:50 AM, Tamas K Lengyel
>>>> <tamas.lengyel@xxxxxxxxxxxx> wrote:
>>>>> On Fri, Sep 23, 2016 at 9:39 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>>>>>> On 23.09.16 at 17:26, <tamas.lengyel@xxxxxxxxxxxx> wrote:
>>>>>>> On Fri, Sep 23, 2016 at 2:24 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>>>>>>>> On 22.09.16 at 19:18, <tamas.lengyel@xxxxxxxxxxxx> wrote:
>>>>>>>>> So I verified that when CPU-based load exiting is enabled, the TLB
>>>>>>>>> flush here is critical. Without it the guest kernel crashes at random
>>>>>>>>> points during boot. OTOH why does Xen trap every guest CR3 update
>>>>>>>>> unconditionally? While we have features such as the vm_event/monitor
>>>>>>>>> that may choose to subscribe to that event, Xen traps it even when
>>>>>>>>> that is not in use. Is that trapping necessary for something else?
>>>>>>>>
>>>>>>>> Where do you see this being unconditional? construct_vmcs()
>>>>>>>> clearly avoids setting these intercepts when using EPT. Are you
>>>>>>>> perhaps suffering from
>>>>>>>>
>>>>>>>>             /* Trap CR3 updates if CR3 memory events are enabled. */
>>>>>>>>             if ( v->domain->arch.monitor.write_ctrlreg_enabled &
>>>>>>>>                  monitor_ctrlreg_bitmask(VM_EVENT_X86_CR3) )
>>>>>>>>                 v->arch.hvm_vmx.exec_control |= 
>>>>>>>> CPU_BASED_CR3_LOAD_EXITING;
>>>>>>>>
>>>>>>>> in vmx_update_guest_cr()? That'll be rather something for you
>>>>>>>> or Razvan to explain. Outside of nested VMX I don't see any
>>>>>>>> other enabling of that intercept (didn't check AMD code on the
>>>>>>>> assumption that you're working on Intel hardware).
>>>>>>>
>>>>>>> So there seems to be two separate paths that lead to the TLB flushing.
>>>>>>> One is indeed the above case you cited when we enable CR3 monitoring
>>>>>>> through the monitor interface. However, during domain boot I also see
>>>>>>> this path being called that is not related to the
>>>>>>> CPU_BASED_CR3_LOAD_EXITING:
>>>>>>>
>>>>>>> (XEN) hap.c:739:d1v0 hap_update_paging_modes is calling hap_update_cr3
>>>>>>> (XEN) hap.c:701:d1v0 HAP update cr3 called
>>>>>>> (XEN) /src/xen/xen/include/asm/hvm/hvm.h:344:d1v0 HVM update guest cr3
>>>> called
>>>>>>> (XEN) vmx.c:1549:d1v0 Update guest CR3 value=0x7a7c4000
>>>>>>>
>>>>>>> This path seems to de-activate once the domain is fully booted.
>>>>>>
>>>>>> This late? According to the CR0 handling in
>>>>>> vmx_update_guest_cr() I would understand it to be enabled only
>>>>>> while the guest is still in real mode (and even then only on old
>>>>>> hardware, i.e. without the Unrestricted Guest functionality).
>>>>>>
>>>>>
>>>>> Right, with unrestricted guest support I would assume none of this
>>>>> would get called - but it does, and quite frequently during domain
>>>>> boot. The CPU is a Intel(R) Xeon(R) CPU E5-2430.
>>>>>
>>>>
>>>> So I experimented with selectively disabling the flushing such that
>>>> it's done only when coming from a path other then CPU-based CR3 load
>>>> exiting. I've added a bool to struct vcpu that gets set to 0 every
>>>> time vmx_vmexit_handler is called, and only gets set to 1 when
>>>> vmx_cr_access reports a MOV-TO-CR3. Then in the vmx_update_guest_cr
>>>> the flush only happens as such:
>>>>
>>>>         if ( !v->movtocr3 )
>>>>             hvm_asid_flush_vcpu(v);
>>>>
>>>> In the guest I run a test application that allocates a page at a fixed
>>>> VA, writes a magic value to it, and then keeps spinning on reading the
>>>> magic value back from the page, checking if it's the same as
>>>> originally supplied. I lunch this application twice with different
>>>> magic values, so that if the TLB invalidation is an issue one of the
>>>> test applications would read back the wrong magic value from the VA
>>>> using a stale TLB entry. I've verified that same VA in the two
>>>> applications point to different pages and that those PTEs are not
>>>> marked global and no PCID is used.
>>>>
>>>> [  724] test (struct addr:ffff88003730f330). PGD: 0x3731f000
>>>> VADDR 0x5000000 -> PADDR 0x73e35000. Global page: 0
>>>> [  727] test (struct addr:ffff88003681ea20). PGD: 0x777a6000
>>>> VADDR 0x5000000 -> PADDR 0x75043000. Global page: 0
>>>
>>> I'm surprised. As said before - a mov-to-CR3 cannot be emulated
>>> without a minimal amount of flushing. No experiments whatsoever
>>> are suitable to prove the contrary.
>>
>> That's a pretty strong statement - can you tell me where in the SDM
>> does it say that exactly? I've went through it couple times already
>> and I can't find anything that explicitly says that the flushing has
>> to be performed by the VMM when mov-to-CR3 trapping is enabled.
>
> I though I had pointed you there already: Section "Instructions
> that cause VM exits". There's nothing said about flushes, but that's
> also not necessary: "... the instruction causing the VM exit does not
> execute and no processor state is updated by the instruction." Plus
> everything the sub-section "Relative Priority of Faults and VM Exits"
> says.
>
>> The
>> closest thing I found was indicating the contrary. Furthermore, if the
>> flushing is necessary, then how would you explain that there were no
>> TLB mixups in the above experiment?
>
> No idea. Perhaps there is some further flushing going on due to
> other reasons?

I've been digging more into this issue and indeed there are many
unrelated VPID flushes happening. One source of such VPID flush has
been SMP migration which is an obvious case that must use new tags.
Pinning the vCPU to a pCPU makes these flushes go away, as expected.
However, I've found two other sources that need more attention:

In x86/flushtlb.c the function flush_area_local invalidates all guest
TLBs as such:

 if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) )
    {
        if ( order == 0 )
        {
...
        }
        else
        {
            u32 t = pre_flush();
            unsigned long cr4 = read_cr4();

            hvm_flush_guest_tlbs();

This flush here to me seems to be only warranted when FLUSH_TLB_GLOBAL
is requested. However, since it is being called for simple TLB flush
requests as well it results in guest tlb flushes very often. When I
change the behavior to only issue this flush when the FLUSH_TLB_GLOBAL
flag is set, the number of flushes issued to all guests is
significantly reduced. So the question is, is this just an oversight
that should be fixed?

The other flush comes from the function write_cr3 also in
x86/flushtlb.c, which was introduced in the patch "[HVM][SVM] flush
all entries from guest ASIDs when xen writes CR3." commit id
eed63189dabd90abe422b0e94ab8854783329bed. From the commit message
however it is not entirely clear to me what exactly warrants having to
flush HVM guest TLBs and how that relates to shadow code. Commenting
this flush out made no difference to the guest or dom0, everything
works as expected. Of course, without understanding the real reason
for why this flush is here it is hard to judge whether this change
(re-)introduces some cornercase issue. It is worth noting this was
added even before VPID was introduced, so we might want to check
whether it is still required. AFAICT flushing the VPID in this case is
fine.

The fourth source for VPID flushing originates from MOV-TO-CR4 when
the guest changes certain flags related to paging. For example, during
boot Linux flips the PGE bit to force a complete TLB flush. This is
the source for the update_paging_mode calls I observed earlier, that
also performs a VPID flush. After Linux boots,these flushes go away.
Windows continues to use this method after boot as well.

Now, with these modifications performed and having Linux booted
completely there are no more spurious guest TLB flushes with VPID -
both the asid generation and the core asid generation is stable. I
repeated the experiment above with the two processes using the same VA
and enabled mov-to-cr3 trapping. There are no VPID flushes happening
as expected. The processes continue to run normally and no stale TLB
entries are observed.

>>>> Both applications work as expected without the VPID flushing taking
>>>> place. So at least for CPU-based CR3 load exiting it seems that this
>>>> flush is not necessary. As for why this path gets called during domain
>>>> boot when the CPU supports Unrestricted Guest mode and it is properly
>>>> detecting when Xen boots, I'm not sure. However, as we use CPU-based
>>>> CR3 load exiting quite often when doing VMI, I would prefer to disable
>>>> this flushing at least for this case. Any thoughts?
>>>
>>> As said before - you'd better direct this question to the VMX
>>> maintainers, and even better would be to first understand why
>>> the intercept remains enabled in the first place. After all it's
>>> quite obvious that most improvement can be expected from not
>>> enabling it at all, whenever possible. Only if it needs to stay
>>> enabled over extended periods of a guest's lifetime it would then
>>> become interesting to see whether the emulation path can be
>>> improved.
>>>
>>
>> To clarify - mov-to-CR3 trapping is _not_ enabled by default on a
>> domain. I assumed it is the only path to vmx_update_guest_cr, but I
>> now further verified that vmx_cr_access does not get called for a
>> mov-to-CR3 when the domain boots, it only gets called when we enable
>> it through the monitor system. There is another path leads to a call
>> to vmx_update_guest_cr for updating CR3 when the domain boots which
>> seems to require this flushing to happen. That other path I don't care
>> about - although it's rather odd in itself as well. Now when the
>> mov-to-CR3 path gets activated the flushing does not seem to be
>> necessary as my experiment shows and it actually actively breaks
>> architectural features (global pages and PCID).
>
> Once again - it does not break anything. Performance aspects are
> not architectural features. All you can say is that it makes these
> extended features useless.
>

Sure that's fair to say. Still worth exploring it a bit more in detail
as the performance benefits gained from tagged TLB features seems to
be fairly limited right now.

Tamas

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.