[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86/PV32: restore PAE-extended-CR3 logic


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Tue, 4 Apr 2023 21:40:44 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=hKY0pwn6ZmXRuUf4xs2ZJ/Z+kpIVH1u23Ydggy2dA1g=; b=ELbxaJHSD+s/lc2W70HNi8YXaJhEjXt7TNn15AydiYWBTaCBHCFdB3al3VJxwNK78c1JhYD3DIMMzOqbYKwf8lMZ9hNyqydt6Lwr108wh98nH//CXTJTD+PBd26UT1vZ1EyarcHJqRWJP2ez0EnXJ4xzt56ycDfNnE11kV+wLR4suCPfHs0ha/qxWMJa10iGom50AEjJ9DsaGWCuSXZWpiRHZOzJ2ezPupZyILaQHyr1V3FicfWZhYmhrMH4lHxzpR6xkD5YMuEH6evyDQL8zh3RlyUF+ikwUfi4aCoPYeywFBqxO5r/s7xPcDP7MXNJ1PP0CyBWbBfMa51nH21W0Q==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oDD7Lo4ulL5XlkD7IBdPqELv80gFlCsQk5bIvQqJUC0S/iZ49ZmUwIYxBEOW8n/BdnD3Kyyklo7ZwlAIZ+gWZqp3t21LyQwClzbmfpe6/jpGDFoG/V5wjy8kJ+ZBd3ywbo+lfnb9disbQKF0arcCYyL1KGSca2qWlbyDY0itKQaw8DprHheZruf/gaIuJ+r1BHW72sp+FzT8NWIrD9H+xihupgpALEXglqx7YwIjJNpt6LbXDLQ46ol4cC3JGLSlx4zItTAnTcWLH6pceuknR/1nFmOi613JGX4cSWoINa9bi9pP8IJMQg16B1/yIWfjiCbV0gWDgfAuJ7+cdyQHDg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Tue, 04 Apr 2023 20:41:18 +0000
  • Ironport-data: A9a23:ErVnjq23vtZHkFjmCfbD5fFwkn2cJEfYwER7XKvMYLTBsI5bpzMAz 2VKUWGAMqyIZjOje9Bwbti18BwCuZ7UydFrHFY5pC1hF35El5HIVI+TRqvS04F+DeWYFR46s J9OAjXkBJppJpMJjk71atANlVEliefTAOK6ULWeUsxIbVcMYD87jh5+kPIOjIdtgNyoayuAo tq3qMDEULOf82cc3lk8tuTS+HuDgNyo4GlD5gBmOagR1LPjvyJ94Kw3dPnZw0TQGuG4LsbiL 87fwbew+H/u/htFIrtJRZ6iLyXm6paLVeS/oiI+t5qK23CulQRrukoPD9IOaF8/ttm8t4sZJ OOhF3CHYVxB0qXkwIzxWvTDes10FfUuFLTveRBTvSEPpqFvnrSFL/hGVSkL0YMkFulfAlpk3 NVJdgk2Y06gmPC9nre4Vusyv5F2RCXrFNt3VnBI6xj8VapjbbWdBqLA6JlfwSs6gd1IEbDGf c0FZDFzbRPGJRpSJlMQD5F4l+Ct7pX9W2QA9BTJ+uxouC6PlmSd05C0WDbRUvWMSd9YgQCzo WXe8n6iKhobKMae2XyO9XfEaurnxHunA9xKTO3inhJsqGfL9jFCWCI9bgOQouuZrWj9Ve8Bd nVBr0LCqoB3riRHVOLVXRe1vXqFtR40QMdLHqsx7wTl4rXQyxaUAC4DVDEpQMwrsoo6SCIn0 neNnsj1Hnp/vbuNU3Wf+7yI6zSoNkAowXQqYCYFSU4J5oflqYRr0hbXFI4/Suiyk8H/Hiz2z 3aSti8iir4PjMkNkaKm4VTAhDHqrZ/MJuIo2jjqsquexlsRTOaYi0aAszA3Md4owF6lc2S8
  • Ironport-hdrordr: A9a23:xw4ARKiWntTJhn2buqE+FBoHDXBQXgYji2hC6mlwRA09TyX4rb HUoB1/73TJYVkqNk3I9ersBEDCewK5yXcN2+gs1O6ZPDUO21HYTr2Kj7GSuwEIcheWnoRgPM FbAs1D4bbLYmSS4/yX3OD2KadG/DArytHPuc7Oi11WZUVBbaV46gdwDQyWVndxWBJNCfMCZf mhD4581kOdRUg=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 04/04/2023 3:21 pm, Jan Beulich wrote:
> On 04.04.2023 15:08, Andrew Cooper wrote:
>> On 15/02/2023 2:54 pm, Jan Beulich wrote:
>>> While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
>>> applies to guests also when run on a 64-bit hypervisor:
>> Is this really true?  Even when looking at Xen 4.2, 32bit guests are
>> required to pass a full 4k page, not a 32b quad.
> The full-page vs 32b-quad aspect is orthogonal. This VM-assist is solely
> about where that data structure is, not what size it is.
>
>> Which makes complete sense.  It was a hard requirement of 32bit non-PAE
>> guests, so it was a natural restriction to maintain into 32bit PAE guests.
>>
>> This is *only* a 32-on-64 issue, because this is the only case a 32bit
>> guest could in principle use an L3 placed above the 4G boundary.
> Not exactly. 32-bit Xen maintained a 4-entry "shadow" array below 4G
> that it would copy (massage) the guest entries into upon CR3 reload
> (just look for struct pae_l3_cache in the old sources). So above-4G
> page table base was possible there as well.

Oh eww, so while Xen never gained an optimisation to permit only a 32b
quad in place of a full 4k L3 table, it did support having the full
tables higher.

(This code is especially hard to follow with #ifdefary in the common
mm.c when there are perfectly good x86_{32,64}/mm.c's to use for
differing function implementations...)

>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -1520,6 +1520,23 @@ static int promote_l3_table(struct page_
>>>      unsigned int   partial_flags = page->partial_flags;
>>>      l3_pgentry_t   l3e = l3e_empty();
>>>  
>>> +    /*
>>> +     * PAE pgdirs above 4GB are unacceptable if a 32-bit guest does not
>>> +     * understand the weird 'extended cr3' format for dealing with 
>>> high-order
>>> +     * address bits. We cut some slack for control tools (before vcpu0 is
>>> +     * initialised).
>>> +     */
>>> +    if ( is_pv_32bit_domain(d) &&
>>> +         unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
>>> +         mfn_x(l3mfn) >= 0x100000 &&
>>> +         d->vcpu[0] && d->vcpu[0]->is_initialised )
>>> +    {
>>> +        gdprintk(XENLOG_WARNING,
>>> +                 "PAE pgd must be below 4GB (%#lx >= 0x100000)",
>>> +                 mfn_x(l3mfn));
>>> +        return -ERANGE;
>>> +    }
>> Having dug through source history, I see this is largely the form that
>> it used to be.
>>
>> But I'm unconvinced by the "cut control tools some slack".  I'm quite
>> tired of different bits of Xen taking on unnecessary complexity because
>> people are unwilling to fix the problem at the correct layer.
> But anything tools do before having created the first vCPU would not
> have had any means to engage the VM-assist. I.e. ...
>
>> A toolstack which has non-pae_extended_cr3 guest on its hand will know
>> this before any pagetables get allocated.
> ... this knowledge buys it nothing: It would need to move the table
> to below 4G irrespective of knowing that the guest can deal with
> bigger addresses, just to get past this check.

This just goes from bad to worse.  It is mad that the VMASSIST flags
can't be set ahead of a vcpu initialise hypercall.

But.

The code in xg_dom_x86.c unconditionally moves the L3 below the 4G
boundary, so the thing actually pinned as an L3 will always pass the check.

Which is just as well because it too blindly applies the extended-cr3
transform momentarily after conditionally setting
VMASST_TYPE_pae_extended_cr3...

So a 32bit PV guests will pass the check irrespective of their
pae_extended_cr3 setting.


I note looking at this code that it's absurd.  The single L3 ought to be
allocated with memf(32), rather than being allocated regularly and then
reallocated lower if it happened to be too high (which will be the
increasingly common case as it's getting harder and harder to find
systems with <4G of RAM).  Making the memf conditional on the
pae_extended_cr3 needs to come with some better xen<->tools APIs.

>> For this check specifically, I'd suggest prohibiting non-32p guests from
>> setting pae_extended_cr3 in the first place (I see no limit currently),
>> and then simplifying the check to just
>>
>> if ( unlikely(!VM_ASSIST(d, pae_extended_cr3)) &&
>>      mfn_x(l3mfn) >= PFN_DOWN(GB(4)) )
> Dropping the is_pv_32bit_domain() check isn't possible because we can't,
> all of the sudden, fail 64-bit guests' requests to enable this VM-
> assist (no matter that we know that it is of no use to them).

I'm not so sure about this.  This VMASSIST cannot credibly be set at
runtime, and making a restriction here is not usefully different from
prior patches of yours that relax checks in Xen that still break on
older builds.

But as I know you're going to argue with that position, I'll at least
note that ignoring a 64bit guest's request to set that bit would be less
bad than the current behaviour.

> Dropping
> the control-tools part of the condition is at least problematic as well,
> as per above. Albeit I'll admit I didn't check whether nowadays vCPU 0
> is initialized before page tables are built. But I think it's more
> sensible the other way around: CR3 setting (in the hypervisor) is less
> involved when the page was already validated as an L3 one.

All of this is before the guest starts running, so it doesn't matter.

The most efficient way (from Xen's point of view) is to pin the L1s,
then L2s, then L3s and then set vCR3, because this is the only order
where we don't have to do do recursive type acquisition.

But, the most efficient way for the toolstack to do this is the opposite
way around, because making Xen do recursive type acquisition is faster
than other ways, and turns all subsequent hypercalls into almost no-ops.

I doubt there is a relevant difference between these two approaches.


And it doesn't matter either.  The check won't ever trip from domain
creation (see above), nor from migration (we set vcpu context before
pinning the pagetables, and a non pae_extended_cr3 will have exploded on
the source side).

So there really are no toolstack codepaths that can trip the check. 
Future improvements that might trip the check can come with a less
broken hypercall as a prerequisite.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.