[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/4] xen: Introduce non-broken hypercalls for the p2m pool size


  • To: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 27 Oct 2022 09:11:44 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=AGA40ZeBhhgDlmH+qfaS5dlLFuP3Lys15u+4bzXL95I=; b=VFMjFpp1/PePaDg2J7VoHHkCuF+6c0Lve3ySdzOtW9nPp4TX7Pi9qWo84NXdphTNMWUhU1Hylb0+3Nks+a/b8GLgdOQ8EqIljs8myLywQ0phRP8UQlQqOvrOsG3BIJiNpU+CQr46VofmuSBym1BR77R83TE2ZO9jycvFfuRtY0o9bdReZgub2qpCvhdzyNuvCRLyheuLuJCcI1HSjCuQTFsWWO2KyRBCccVuuYU1klyB8CqKKW8+m0ADWGd+et+kGNyCOyGyxkTqRmsP+kKneic5u2tWnHahLBv2RsOVjvZnFofnW4BeIowRxdbnvXAepA2WcVW0V8k7dd2w8hVsIw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=PaqXtTgVGE2OSJBIlnH6PYiufJqWKOWKfUZicw7nSx3kToFw1OdkEJRwZu4Md8PnZJuHo18FDpMRp58FypBhBZrddUOp9zvkUozfsaGYUCH0BftN7oOpsd4yZbJd+2c9x85+2wmdM0AbOox1JdQR0hGiPH8XK1vabkZGxuupBzWWLLHk1KbJ1NqAlwhaipZdIW/RBFeIAf1N0Dg74PtElPY4xrqdnKazFKKPHMEW8xKp3VKiMf+YfS0lvAtozB/k2xnlrix48IOOZrdbAWDMcE8Aga05m7TEHY4V5MVN43KZTJiO2qyV1QlEkejLcf0FVdRHiezIgmBKU26RzSg1oA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Roger Pau Monne <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>, Anthony Perard <anthony.perard@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 27 Oct 2022 07:11:59 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 26.10.2022 21:22, Andrew Cooper wrote:
> On 26/10/2022 14:42, Jan Beulich wrote:
>> On 26.10.2022 12:20, Andrew Cooper wrote:
>>> The existing XEN_DOMCTL_SHADOW_OP_{GET,SET}_ALLOCATION have problems:
>>>
>>>  * All set_allocation() flavours have an overflow-before-widen bug when
>>>    calculating "sc->mb << (20 - PAGE_SHIFT)".
>>>  * All flavours have a granularity of of 1M.  This was tolerable when the 
>>> size
>>>    of the pool could only be set at the same granularity, but is broken now
>>>    that ARM has a 16-page stopgap allocation in use.
>>>  * All get_allocation() flavours round up, and in particular turn 0 into 1,
>>>    meaning the get op returns junk before a successful set op.
>>>  * The x86 flavours reject the hypercalls before the VM has vCPUs allocated,
>>>    despite the pool size being a domain property.
>> I guess this is merely a remnant and could easily be dropped there.
> 
> It's intermixed the other shadow operations.  It wasn't trivially-safe
> enough to do here, and needs coming back to in future work.

Right, and I should have said that this is merely a remark, not a request
for any change here.

>>> Name subject to improvement.
>> paging_{get,set}_mempool_size() for the arch helpers (in particular
>> fitting better with them living in paging.c as well its multi-purpose use
>> on x86) and XEN_DOMCTL_{get,set}_paging_mempool_size? Perhaps even the
>> "mem" could be dropped?
> 
> Yeah, this was a placeholder for "what are we actually going to call it
> in Xen".
> 
> I went with mempool over just simply pool because pool has a very
> different meaning slightly higher in the toolstack where you talk about
> pools of servers.  Admittedly, that's code outside of xen.git, but the
> hypercall names do percolate up into those codebases.
> 
> paging isn't a great name.  While it's what we call the infrastructure
> in x86, it has nothing to do with paging things out to disk (the thing
> everyone associates the name with), nor the xenpaging infrastructure
> (Xen's version of what OS paging supposedly means).

Okay, "paging" can be somewhat misleading. But "p2m" also doesn't fit
the use(s) on x86. Yet we'd like to use a name clearly better than the
previous (and yet more wrong/misleading) "shadow". I have to admit that
I can't think of any other sensible name, and among the ones discussed
I still think "paging" is the one coming closest despite the
generally different meaning of the word elsewhere.

>>>  ABI not.
>> With the comment in the public header saying "Users of this interface are
>> required to identify the granularity by other means" I wonder why the
>> interface needs to be byte-granular. If the caller needs to know page size
>> by whatever means, it can as well pass in a page count.
> 
> Not all architectures have pagetable levels of uniform size.  Not all
> architectures have the mapping granularity equal to the pagetable size. 
> x86 has examples of both of these (and in a rogue move, one x86 hardware
> vendor is trying to add even more pagetable asymmetry).  Other
> architectures substantially more variety.
> 
> Even on x86, there are performance advantages from using 8k or 16k
> arrangements, which could cause us insist upon >4k requirements here. 
> (TBH, not actually for this usecase, but the principle is still valid.)

Perhaps, but that doesn't change the picture: The tool stack still needs
to know how many of the low bits in the request need to be clear (unless
you would accept to go back to rounding an unaligned input value). And
once it knows this value, it can still convert to a count of that-unit-
sized blocks of memory.

> The reason is because this is a size.  Sizes are in bytes, and that's
> how everyone thinks about them.  Its how the value is already specified
> in an xl cfg file, and it entirely unambiguous at all levels of the stack.
> 
> Every translation of the value in the software stack risks breaking
> things, even stuff as simple as debugging.  As proof, count the number
> of translation errors I've already identified in this patch alone.
> 
> This ABI does not require any changes at all (not even recompiling
> userspace) for ARM to decide to use 16k or 64k pagetables in Xen, or for
> x86 to decide that 8k or 16k is beneficial enough to actually require.
> 
> Attempting to compress this uint64_t into something smaller by any means
> will create bugs, or at increased complexity and a high risk of bugs. 
> There isn't enough money on earth right now to afford a 128bit processor
> with enough ram for this current ABI to need changing.

I didn't suggest to use a type other than uint64_t. I'm merely puzzled
by your insistence on byte granularity while at the same time requiring
inputs to be suitable multiples of a base granularity, obtaining of
which is not even specified alongside this new interface.

> This is going to be a reoccurring theme through fixing the ABIs.  Its
> one of a several areas where there is objectively one right answer, both
> in terms of ease of use, and compatibility to future circumstances.

Well, I wouldn't say using whatever base granularity as a unit is
"objectively" less right.

>>> @@ -157,6 +165,25 @@ int p2m_set_allocation(struct domain *d, unsigned long 
>>> pages, bool *preempted)
>>>      return 0;
>>>  }
>>>  
>>> +int arch_set_p2m_mempool_size(struct domain *d, uint64_t size)
>>> +{
>>> +    unsigned long pages = size >> PAGE_SHIFT;
>>> +    bool preempted = false;
>>> +    int rc;
>>> +
>>> +    if ( (size & ~PAGE_MASK) ||          /* Non page-sized request? */
>>> +         pages != (size >> PAGE_SHIFT) ) /* 32-bit overflow? */
>>> +        return -EINVAL;
>> Simply "(pages << PAGE_SHIFT) != size"? And then move the check into
>> common code?
> 
> These checks are deliberately not in common code.  That's just creating
> work that someone will need to undo in due course.

Would you mind clarifying why you think so? If the base unit isn't PAGE_SIZE
then all it takes is to introduce a suitable #define and/or global
specifying the intended per-arch value. Even if you expected this to become
a domain-dependent property, the corresponding value could still be a field
in (common) struct domain.

>>> --- a/xen/arch/x86/mm/paging.c
>>> +++ b/xen/arch/x86/mm/paging.c
>>> @@ -977,6 +977,45 @@ int __init paging_set_allocation(struct domain *d, 
>>> unsigned int pages,
>>>  }
>>>  #endif
>>>  
>>> +int arch_get_p2m_mempool_size(struct domain *d, uint64_t *size)
>>> +{
>>> +    int rc;
>>> +
>>> +    if ( is_pv_domain(d) )
>>> +        return -EOPNOTSUPP;
>>> +
>>> +    if ( hap_enabled(d) )
>>> +        rc = hap_get_allocation_bytes(d, size);
>>> +    else
>>> +        rc = shadow_get_allocation_bytes(d, size);
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +int arch_set_p2m_mempool_size(struct domain *d, uint64_t size)
>>> +{
>>> +    unsigned long pages = size >> PAGE_SHIFT;
>>> +    bool preempted = false;
>>> +    int rc;
>>> +
>>> +    if ( is_pv_domain(d) )
>>> +        return -EOPNOTSUPP;
>> Why? You do say "PV is weird" in a post-commit-message remark, but why
>> do you want to retain this weirdness? Even if today the tool stack
>> doesn't set the size when enabling log-dirty mode, I'd view this as a
>> bug which could be addressed purely in the tool stack if this check
>> wasn't there.
> 
> I want to clean up PV, but again, it wasn't sufficiently trivially-safe
> to do right now.
> 
> PV is weird because it is neither hap_enabled() (fundamentally), nor
> shadow_enabled() when logdirty isn't active.  While the freelist is
> suitably constructed, the get/set operations were previously rejected
> and cleanup is local to the disable op, not domain shutdown.
> 
> I could put in a /* TODO: relax in due course */ if you'd prefer?

Yes please - that would clarify this isn't a hard requirement.

>>> +    if ( size & ~PAGE_MASK )             /* Non page-sized request? */
>>> +        return -EINVAL;
>>> +
>>> +    ASSERT(paging_mode_enabled(d));
>> Not only with the PV aspect in mind - why? It looks reasonable to me
>> to set the pool size before enabling any paging mode.
> 
> Because this is how all the existing logic is expressed, and this patch
> wants backporting.

What do you mean by "is expressed"? I can't seem to be able to find a
similar check on the existing code paths. But given that yesterday I
almost overlooked the d->vcpu check in paging_domctl(), I can easily
accept that I might be overlooking something somewhere.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.