[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 5/5] x86/HVM: cache emulated instruction for retry processing



Egger, Christoph wrote on 2014-01-07:
> On 07.01.14 09:54, Zhang, Yang Z wrote:
>> Jan Beulich wrote on 2014-01-07:
>>>>>> On 24.12.13 at 12:29, "Zhang, Yang Z" <yang.z.zhang@xxxxxxxxx>
> wrote:
>>>> Jan Beulich wrote on 2013-12-18:
>>>>>>>> On 18.12.13 at 10:40, "Zhang, Yang Z" <yang.z.zhang@xxxxxxxxx>
>>> wrote:
>>>>>> Jan Beulich wrote on 2013-12-18:
>>>>>>>>>> On 18.12.13 at 09:36, "Zhang, Yang Z"
>>>>>>>>>> <yang.z.zhang@xxxxxxxxx>
>>>>> wrote:
>>>>>>>> Jan Beulich wrote on 2013-09-30:
>>>>>>>>> Rather than re-reading the instruction bytes upon retry
>>>>>>>>> processing, stash away and re-use tha what we already read. That
>>>>>>>>> way we can be certain that the retry won't do something
>>>>>>>>> different from what requested the retry, getting once again
>>>>>>>>> closer to real hardware behavior (where what we use retries for
>>>>>>>>> is simply a bus operation, not involving redundant decoding of
>>>>>>>>> instructions).
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> This patch doesn't consider the nested case.
>>>>>>>> For example, if the buffer saved the L2's instruction, then
>>>>>>>> vmexit to
>>>>>>>> L1 and
>>>>>>>> L1 may use the wrong instruction.
>>>>>>> 
>>>>>>> I'm having difficulty seeing how the two could get intermixed:
>>>>>>> There should be, at any given point in time, at most one
>>>>>>> instruction being emulated. Can you please give a more
>>>>>>> elaborate explanation of the situation where you see a (theoretical?
>>>>>>> practical?)
>>>> problem?
>>>>>> 
>>>>>> I saw this issue when booting L1 hyper-v. I added some debug
>>>>>> info and saw the strange phenomenon:
>>>>>> 
>>>>>> (XEN) write to buffer: eip 0xfffff8800430bc80, size 16,
>>>>>> content:f7420c1f608488b 44000000011442c7
>>>>>> (XEN) read from buffer: eip 0xfffff800002f6138, size 16,
>>>>>> content:f7420c1f608488b 44000000011442c7
>>>>>> 
>>>>>> From the log, we can see different eip but using the same buffer.
>>>>>> Since I don't know how hyper-v working, so I cannot give more
>>>>>> information why this happens. And I only saw it with L1 hyper-v
>>>>>> (Xen on Xen and KVM on Xen don't have this issue) .
>>>>> 
>>>>> But in order to validate the fix is (a) needed and (b) correct, a
>>>>> proper understanding of the issue is necessary as the very first step.
>>>>> Which doesn't require knowing internals of Hyper-V, all you need
>>>>> is tracking of emulator state changes in Xen (which I realize is
>>>>> said easier than done, since you want to make sure you don't
>>>>> generate huge amounts of output before actually hitting the
>>>>> issue, making it close to
>>>> impossible to analyze).
>>>> 
>>>> Ok. It should be an old issue and just exposed by your patch only:
>>>> Sometimes, L0 need to decode the L2's instruction to handle IO access
>>>> directly. For example, if L1 pass through (not VT-d) a device to L2
>>>> directly. And L0 may get X86EMUL_RETRY when handling this IO request.
>>>> At same time, if there is a virtual vmexit pending (for example, an
>>>> interrupt pending to inject to L1) and hypervisor will switch the
>>>> VCPU context from L2 to L1. Now we already in L1's context, but since
>>>> we got a X86EMUL_RETRY just now and this means hyprevisor will retry
>>>> to handle the IO request later and unfortunately, the retry will
>>>> happen in L1's context. Without your patch, L0 just emulate an L1's
>>>> instruction and everything goes on. With your patch, L0 will get the
>>>> instruction from the buffer which is belonging to L2, and problem
>>>> arises.
>>>> 
>>>> So the fixing is that if there is a pending IO request, no virtual
>>>> vmexit/vmentry is allowed which following hardware's behavior.
>>>> 
>>>> diff --git a/xen/arch/x86/hvm/vmx/vvmx.c
>>>> b/xen/arch/x86/hvm/vmx/vvmx.c index 41db52b..c5446a9 100644
>>>> --- a/xen/arch/x86/hvm/vmx/vvmx.c
>>>> +++ b/xen/arch/x86/hvm/vmx/vvmx.c
>>>> @@ -1394,7 +1394,10 @@ void nvmx_switch_guest(void)
>>>>      struct vcpu *v = current;
>>>>      struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v);
>>>>      struct cpu_user_regs *regs = guest_cpu_user_regs();
>>>> +    ioreq_t *p = get_ioreq(v);
>>>> 
>>>> +    if ( p->state != STATE_IOREQ_NONE )
>>>> +        return;
>>>>      /*
>>>>       * a softirq may interrupt us between a virtual vmentry is
>>>>       * just handled and the true vmentry. If during this window,
>>> 
>>> This change looks much more sensible; question is whether simply
>>> ignoring the switch request is acceptable (and I don't know the
>>> nested HVM code well enough to tell). Furthermore I wonder whether
>>> that's
> really a VMX-only issue.
>> 
>> From hardware's point, an IO operation is handled atomically. So the
>> problem will never happen. But in Xen, an IO operation is divided into
>> several steps. Without nested virtualization, the VCPU is paused or
>> looped until the IO emulation is finished. But in nested environment,
>> the VCPU will continue running even the IO emulation is not finished.
>> So my patch will check this and allow the VCPU to continue running only
>> there is no pending IO request. This is matching hardware's behavior.
>> 
>> I guess SVM also has this problem. But since I don't know nested SVM
>> well, so perhaps Christoph can help to have a double check.
> 
> For SVM this issue was fixed with commit
> d740d811925385c09553cbe6dee8e77c1d43b198
> 
> And there is a followup cleanup commit
> ac97fa6a21ccd395cca43890bbd0bf32e3255ebb
> 
> The change in nestedsvm.c in commit
> d740d811925385c09553cbe6dee8e77c1d43b198 is actually not SVM specific.
> 
> Move that into nhvm_interrupt_blocked() in hvm.c right before
> 
>     return hvm_funcs.nhvm_intr_blocked(v);
> and the fix applies for both SVM and VMX.
> 

I guess this is no enough. L2 may vmexit to L1 during IO emulation not only due 
to interrupt. Now I cannot give an example, but the hyper-v cannot boot up with 
your suggestion. So I guess only consider external interrupt is not enough. We 
should prohibit vmswith if there is a pending IO emulation both from L1 or 
L2(This may never happen to L1, but from hardware's behavior, we should add 
this check).

Best regards,
Yang



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.