Xen project Mailing List

Re: [Xen-devel] nvmx deadlock with MSR bitmaps

To: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Jürgen Groß <jgross@xxxxxxxx>

Date: Thu, 12 Mar 2020 16:20:22 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Paul Durrant <paul@xxxxxxx>

Delivery-date: Thu, 12 Mar 2020 15:20:28 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12.03.20 16:11, Jan Beulich wrote:

On 12.03.2020 14:44, Roger Pau Monné wrote:

On Thu, Mar 12, 2020 at 12:12:12PM +0100, Jürgen Groß wrote:

On 12.03.20 11:56, Roger Pau Monné wrote:

On Thu, Mar 12, 2020 at 09:59:48AM +0100, Jan Beulich wrote:

On 11.03.2020 19:04, Andrew Cooper wrote:

Specifically, this is a switch from an HVM vcpu, to a PV vcpu, where the
mapcache code tries to access the per-domain mappings on the HVM monitor
table.  It ends up trying to recursively acquire the mapcache lock while
trying to walk %cr2 to identify the source of the fault.

For nvmx->msr_merged, this needs to either be a xenheap page, or a
globally mapped domheap page.  I'll draft a patch in a moment.

For map_domain_page(), is there anything we can rationally do to assert
that it isn't called in the middle of a context switch?  This is the
kind of thing which needs to blow up reliably in a debug build.


Well, it's not inherently unsafe to do, it's just that
mapcache_current_vcpu() would need to avoid using current from
context_switch()'s call to set_current() through to
__context_switch()'s call to write_ptbase(). A possible
detection (if we don't want to make the case work) would
seem to be ASSERT(current == this_cpu(curr_vcpu)). But of course
there's also this extra logic in mapcache_current_vcpu() to deal
with a PV vCPU having a null v->arch.guest_table, which I'm once
again struggling to see under what conditions it might happen.
The Dom0 building case can't be meant with there being
mapcache_override_current() on that path. I'm wondering if the
comment there is misleading and it's really to cover the case
where, coming from a PV vCPU, current was already set to the
idle vCPU by context_switch() (which would have a null
v->arch.guest_table) - I wouldn't call this "we are running a
paravirtualised guest". But in such a case the logic here would
simply be a (too) special case of what you're describing as the
issue with nVMX.


Looking at the code in context_switch and __context_switch would it be
possible to set current to the next vCPU after all the from hooks have
been called?

Ie: set_current could be moved into __context_switch after the call to
pd->arch.ctxt_switch->from(p).


No, wouldn't work. When switching to idle __context_switch() is normally
not called in order to avoid switching the address space in case the
same vcpu will be scheduled again after idle. This is the reason why
current and curr_vcpu can be different.


Since the idle vCPU context switch is already a special case we could
maybe place the call to set_current in such special handling, while
leaving the call for the rest of vCPUs in __context_switch after the
->from hooks have been executed?

I'm also not sure I understand the difference between context_switch
and __context_switch, and how are callers supposed to use them.


Jan points out that __context_switch signals a pending context switch,
in which case my proposal to use set_current might not be suitable, as
we would be changing current without actually doing the context
switch?

I also wonder why __context_switch then needs to call the ->from hook
just to signal a pending context switch. It seems like
__context_switch does a lot of work just to signal a context switch,
which will then be redone when context_switch is actually called?


Well, "signal" was perhaps not the best choice of a word.

Basically you can say that context_switch() is switching vcpus logically (Xen internal bookkeeping), while __context_switch() is switching the runtime environment (address space, GDT settings, cr4, ...). Juergen _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.