[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V2] x86/emulate: synchronize LOCKed instruction emulation



>>> On 23.03.17 at 11:21, <rcojocaru@xxxxxxxxxxxxxxx> wrote:
>>>>>> Sadly, I've now written this (rough) patch:
>>>>>>
>>>>>> http://pastebin.com/3DJ5WYt0 
>>>>>>
>>>>>> only to find that it does not solve our issue. With multiple processors
>>>>>> per guest and heavy emulation at boot time, the VM got stuck at roughly
>>>>>> the same point in its life as before the patch. Looks like more than
>>>>>> CMPXCHG needs synchronizing.
>>>>>>
>>>>>> So it would appear that the smp_lock patch is the only way to bring this
>>>>>> under control. Either that, or my patch misses something.
>>>>>> Single-processor guests work just fine.
>>>>>
>>>>> Well, first of all the code to return RETRY is commented out. I
>>>>> may guess that you've tried with it not commented out, but I
>>>>> can't be sure.
>>>>
>>>> Indeed I have tried with it on, with the condition for success being at
>>>> first "val != old", and then, as it is in the pasted code, "val == new".
>>>> Both of these caused BSODs and random crashes. The guest only boots
>>>> without any apparent issues (and with 1 VCPU only) after I've stopped
>>>> returning RETRY.
>>>
>>> "val == new" is clearly wrong: Whether to retry depends on whether
>>> the cmpxchg was unsuccessful.
>> 
>> You're right, it looks like using "val == old" as a test for success
>> (which is of course the only correct test) and return RETRY on fail
>> works (though I'm still seeing a BSOD very seldom, I'll need to test it
>> carefully and see what's going on).
> 
> I've made a race more probable by resuming the VCPU only after
> processing all the events in the ring buffer and now all my Windows 7
> 32-bit guests BSOD with IRQL_NOT_LESS_OR_EQUAL every time at boot with
> the updated patch.
> 
> Changing the guests' configuration to only use 1 VCPU again solves the
> issue, and also resuming the VCPU after treating each vm_event makes the
> BSOD much less likely to appear (hence the illusion that the problem is
> fixed).

Well, I can only repeat: We need to understand what it is that
goes wrong.

> As for an instruction's crossing a page boundary, I'm not sure how that
> would affect things here - we're simply emulating whatever instructions
> cause page faults in interesting pages at boot, we're not yet preventing
> writes at this point. And if it were such an issue, would it have been
> fixed (and it is) by the smp_lock version of the patch?

Yes, it likely would have been: The problem with your code is that
you map just a single page even when the original access crosses
a page boundary, yet then access the returned pointer as if you
had mapped two adjacent pages.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.