Xen project Mailing List

Re: [PATCH 1/4] xen: Introduce non-broken hypercalls for the p2m pool size

From: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>

Date: Wed, 26 Oct 2022 19:22:48 +0000

Accept-language: en-GB, en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=WjzLjmUBSF8IfMHwDmsvFGtp+uN9bZMhqT2udzAfy+A=; b=OSR+8WZfRuXESaMC7qva+q/S1Zh8hqlbA//4ANB/mvUC6WV0BGebptkwIodjryINVvohPyONv4K+ELpTLYE+UaUKCJoXsAYD3r5P+LHgBNgy9+e6w8d7/yPS6cFrlGwToO2nnmvphZKQG17JFfO63JStUfru+3zuagoxeJlCkMbkkniXn4n8bOJFVQdUuMuAqx/GwCv7IsyIAOvAB6mPiyQJ9A++G35whIVlNAhQ+kuHlPTFTno52HPqE60IA6cppifgdV6oOqRsBuXMu1dr+tiGsCdCCXFXaDshQ27wJdofmrGSc0YquK2N6y4JKaxeJS2cRe4psfceyUjL/5llxA==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=E3UZdXGBcM5GjvYK+kG6Tr9AhBumsDaJhUea2HcduLG/xTC82Cr0M4ewwuE2npk/SIOs0N/kklBUa68Mg02xVwgyBVX5DY1xDRyOvMbK7mUpvVpk5i4V8UQ16/NXXBfcYmp0bwVBH7wkVBWbMhsMd4H8gbRX7ljRtPllmtBIanI4h66YmPQ1KxX3QydN1Q0AUUCjvmmFeoTHSJ44M0aaB539zuhedK5N7MOCnBDiPi44Cfn+Ng9XYSd5etwGTj/S7Bd9WSQlhaXRRX7/ONwHBCLIpGxnkpRA3P1uGDUHCgZvuqxB7aGcXj6Txul/HVdsH6LJdrZ002EKD4v8Db8fRQ==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: Roger Pau Monne <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>, Anthony Perard <anthony.perard@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 26 Oct 2022 19:23:04 +0000

Ironport-data: A9a23:gRHv5KxwIHya3ZXNsCl6t+enxyrEfRIJ4+MujC+fZmUNrF6WrkUEm GoXXG+PaazcMTageI11atjl/RkPvZ+DnIdkGQs9qyAxQypGp/SeCIXCJC8cHc8wwu7rFxs7s ppEOrEsCOhuExcwcz/0auCJQUFUjP3OHPykYAL9EngZbRd+Tys8gg5Ulec8g4p56fC0GArIs t7pyyHlEAbNNwVcbyRFtspvlDs15K6o4WtC4ARkDRx2lAS2e0c9Xcp3yZ6ZdxMUcqEMdsamS uDKyq2O/2+x13/B3fv8z94X2mVTKlLjFVDmZkh+AsBOsTAbzsAG6Y4pNeJ0VKtio27hc+ada jl6ncfYpQ8BZsUgkQmGOvVSO3kW0aZuoNcrLZUj2CA6IoKvn3bEmp1T4E8K0YIwyuJ+DlFs2 9wjGT0fM0CFvcm26a+rVbw57igjBJGD0II3nFhFlGicJtF/BJfJTuPN+MNS2yo2ioZWB/HCa sEFaD1pKhPdfxlIPVRRA5U79AuqriCnL3sE9xTI++xuvDa7IA9ZidABNPL8fNCQSNoTtUGfv m/cpEzyAw0ANczZwj2Amp6prr+RxHurCNJKfFG+3r1VrW2YwzMqMg8PeEqChsuTiUSkBt0Kf iT4/QJr98De7neDTNPwQhm5q36spQMHVpxbFOhSwB6J4rrZ5UCeHGdsZiVadNUsucsyRDor/ lyEhdXkAXpoqrL9YXCA8raZqxuiNC5TKnUNDQc6Sg8C78jmsZsEpBvFRdZ+E4a4ltTwXzr3x liipiUkm68ai8JN0qyh5E3GmBqlvJ2PRQkwji33U2S//0VGbYiqT4Wy7B7Q6vMoBJaUSByNs WYJn+Ca7fsSFtedmSqVWuIPEbq1ofGfP1XhbUVHGpAg83Gm/CeldIUJuTVmfh42bIADZCPjZ 1LVtUVJ/phPMXC2bKhxJYWsF8AtyqumHtPgPhzJUudzjlFKXFfv1ElTiYS4hggBTGBEfXkDB Kqm

Ironport-hdrordr: A9a23:fNAbPqNK9+V7PcBcT2L155DYdb4zR+YMi2TDiHoddfUFSKalfp 6V98jzjSWE8wr4WBkb6LO90DHpewKQyXcH2/hqAV7EZnirhILIFvAp0WKG+VHd8kLFh4lgPM tbEpSWTeeAdWSS7vyKrzVQcexQpuVvmZrA7Yix854ud3ASV0gK1XYaNu/vKDwTeOAwP+tdKH Pz3Kp6jgvlXU5SQtWwB3EDUeSGjcbMjojabRkPAANiwBWSjBuzgYSKUiSw71M7aXdi0L0i+W /Kn0jS/aO4qcy2zRfayiv684lWot380dFObfb8yvT9aw+cyTpAVr4RHoFqjwpF5N1HL2xa1+ Ukli1QffibLUmhOF1d7yGdgjUImwxelkMKgWXo/UcL5/aJCg7SQvAx+76wOHHimjUdlcA536 RR022DsZ1LSRvGgSTm/tDNEwpnj0yuvBMZ4KcuZlFkIPwjgYVq3Poi1VIQFI1FEDPx6YghHu UrBMbA5OxOeVffa3zCpGFgzNGlQ3x2R369MwM/k93Q1yITkGFyzkMeysBalnAc9IglQ50B4+ jfKKxnmLxHU8dTZ6NgA+UKR9exFwX2MFrxGXPXJU6iGLAMOnrLpZKy6LIp5PuycJhN15c2kI SpaiItiYfzQTOaNSSj5uw5zvmWehTNYd3E8LAv27Fp/rvhWbHsLSqPDFgzjsrImYRsPvHm

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHY6SSZFNY32ndKvEeCzPqKJo4FHa4grzqAgABfJgA=

Thread-topic: [PATCH 1/4] xen: Introduce non-broken hypercalls for the p2m pool size

On 26/10/2022 14:42, Jan Beulich wrote: > On 26.10.2022 12:20, Andrew Cooper wrote: >> The existing XEN_DOMCTL_SHADOW_OP_{GET,SET}_ALLOCATION have problems: >> >> * All set_allocation() flavours have an overflow-before-widen bug when >> calculating "sc->mb << (20 - PAGE_SHIFT)". >> * All flavours have a granularity of of 1M. This was tolerable when the >> size >> of the pool could only be set at the same granularity, but is broken now >> that ARM has a 16-page stopgap allocation in use. >> * All get_allocation() flavours round up, and in particular turn 0 into 1, >> meaning the get op returns junk before a successful set op. >> * The x86 flavours reject the hypercalls before the VM has vCPUs allocated, >> despite the pool size being a domain property. > I guess this is merely a remnant and could easily be dropped there. It's intermixed the other shadow operations. It wasn't trivially-safe enough to do here, and needs coming back to in future work. > >> * Even the hypercall names are long-obsolete. >> >> Implement an interface that doesn't suck, which can be first used to unit >> test >> the behaviour, and subsequently correct a broken implementation. The old >> interface will be retired in due course. >> >> This is part of XSA-409 / CVE-2022-33747. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> >> --- >> CC: Xen Security Team <security@xxxxxxx> >> CC: Jan Beulich <JBeulich@xxxxxxxx> >> CC: Roger Pau Monné <roger.pau@xxxxxxxxxx> >> CC: Wei Liu <wl@xxxxxxx> >> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx> >> CC: Julien Grall <julien@xxxxxxx> >> CC: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx> >> CC: Bertrand Marquis <bertrand.marquis@xxxxxxx> >> CC: Henry Wang <Henry.Wang@xxxxxxx> >> CC: Anthony PERARD <anthony.perard@xxxxxxxxxx> >> >> Name subject to improvement. > paging_{get,set}_mempool_size() for the arch helpers (in particular > fitting better with them living in paging.c as well its multi-purpose use > on x86) and XEN_DOMCTL_{get,set}_paging_mempool_size? Perhaps even the > "mem" could be dropped? Yeah, this was a placeholder for "what are we actually going to call it in Xen". I went with mempool over just simply pool because pool has a very different meaning slightly higher in the toolstack where you talk about pools of servers. Admittedly, that's code outside of xen.git, but the hypercall names do percolate up into those codebases. paging isn't a great name. While it's what we call the infrastructure in x86, it has nothing to do with paging things out to disk (the thing everyone associates the name with), nor the xenpaging infrastructure (Xen's version of what OS paging supposedly means). > >> ABI not. > With the comment in the public header saying "Users of this interface are > required to identify the granularity by other means" I wonder why the > interface needs to be byte-granular. If the caller needs to know page size > by whatever means, it can as well pass in a page count. Not all architectures have pagetable levels of uniform size. Not all architectures have the mapping granularity equal to the pagetable size. x86 has examples of both of these (and in a rogue move, one x86 hardware vendor is trying to add even more pagetable asymmetry). Other architectures substantially more variety. Even on x86, there are performance advantages from using 8k or 16k arrangements, which could cause us insist upon >4k requirements here. (TBH, not actually for this usecase, but the principle is still valid.) The reason is because this is a size. Sizes are in bytes, and that's how everyone thinks about them. Its how the value is already specified in an xl cfg file, and it entirely unambiguous at all levels of the stack. Every translation of the value in the software stack risks breaking things, even stuff as simple as debugging. As proof, count the number of translation errors I've already identified in this patch alone. This ABI does not require any changes at all (not even recompiling userspace) for ARM to decide to use 16k or 64k pagetables in Xen, or for x86 to decide that 8k or 16k is beneficial enough to actually require. Attempting to compress this uint64_t into something smaller by any means will create bugs, or at increased complexity and a high risk of bugs. There isn't enough money on earth right now to afford a 128bit processor with enough ram for this current ABI to need changing. This is going to be a reoccurring theme through fixing the ABIs. Its one of a several areas where there is objectively one right answer, both in terms of ease of use, and compatibility to future circumstances. > >> Future TODOs: >> * x86 shadow still rounds up. This is buggy as it's a simultaneous equation >> with tot_pages which varies over time with ballooning. >> * x86 PV is weird. There are no toolstack interact with the shadow pool >> size, but the "shadow" pool it does come into existence when logdirty (or >> pv-l1tf) when first enabled. >> * The shadow+hap logic is in desperate need of deduping. > I have a tiny step towards this queued as post-XSA-410 work, folding HAP's > and shadow's freelist, total_pages, free_pages, and p2m_pages. Here this > would mean {hap,shadow}_get_allocation_bytes() could be done away with, > having the logic exclusively in paging.c. Thanks. I'll drop that task from my todo list. But really, it need to be fully common, because RISC-V is going to need it too. (I'm told development on RISC-V will start back up any time now.) > >> --- a/xen/arch/arm/p2m.c >> +++ b/xen/arch/arm/p2m.c >> @@ -100,6 +100,14 @@ unsigned int p2m_get_allocation(struct domain *d) >> return ROUNDUP(nr_pages, 1 << (20 - PAGE_SHIFT)) >> (20 - PAGE_SHIFT); >> } >> >> +/* Return the size of the pool, in bytes. */ >> +int arch_get_p2m_mempool_size(struct domain *d, uint64_t *size) >> +{ >> + *size = ACCESS_ONCE(d->arch.paging.p2m_total_pages) << PAGE_SHIFT; > This may overflow for Arm32. So it will. I'll widen first. > >> @@ -157,6 +165,25 @@ int p2m_set_allocation(struct domain *d, unsigned long >> pages, bool *preempted) >> return 0; >> } >> >> +int arch_set_p2m_mempool_size(struct domain *d, uint64_t size) >> +{ >> + unsigned long pages = size >> PAGE_SHIFT; >> + bool preempted = false; >> + int rc; >> + >> + if ( (size & ~PAGE_MASK) || /* Non page-sized request? */ >> + pages != (size >> PAGE_SHIFT) ) /* 32-bit overflow? */ >> + return -EINVAL; > Simply "(pages << PAGE_SHIFT) != size"? And then move the check into > common code? These checks are deliberately not in common code. That's just creating work that someone will need to undo in due course. > >> --- a/xen/arch/x86/mm/hap/hap.c >> +++ b/xen/arch/x86/mm/hap/hap.c >> @@ -345,6 +345,16 @@ unsigned int hap_get_allocation(struct domain *d) >> + ((pg & ((1 << (20 - PAGE_SHIFT)) - 1)) ? 1 : 0)); >> } >> >> +int hap_get_allocation_bytes(struct domain *d, uint64_t *size) >> +{ >> + unsigned long pages = (d->arch.paging.hap.total_pages + >> + d->arch.paging.hap.p2m_pages); > Unlike for Arm no ACCESS_ONCE() here? Also the addition can in > principle overflow, because being done only in 32 bits. I'm not actually convinced ARM needs ACCESS_ONCE() to begin with. I can't see any legal transformation of that logic which could result in a torn load. Both examples were written to match the existing code, because this needs backporting to all security trees. I forgot to mention the overflow on x86 in the future todo section. This code is rife with them. > >> --- a/xen/arch/x86/mm/paging.c >> +++ b/xen/arch/x86/mm/paging.c >> @@ -977,6 +977,45 @@ int __init paging_set_allocation(struct domain *d, >> unsigned int pages, >> } >> #endif >> >> +int arch_get_p2m_mempool_size(struct domain *d, uint64_t *size) >> +{ >> + int rc; >> + >> + if ( is_pv_domain(d) ) >> + return -EOPNOTSUPP; >> + >> + if ( hap_enabled(d) ) >> + rc = hap_get_allocation_bytes(d, size); >> + else >> + rc = shadow_get_allocation_bytes(d, size); >> + >> + return rc; >> +} >> + >> +int arch_set_p2m_mempool_size(struct domain *d, uint64_t size) >> +{ >> + unsigned long pages = size >> PAGE_SHIFT; >> + bool preempted = false; >> + int rc; >> + >> + if ( is_pv_domain(d) ) >> + return -EOPNOTSUPP; > Why? You do say "PV is weird" in a post-commit-message remark, but why > do you want to retain this weirdness? Even if today the tool stack > doesn't set the size when enabling log-dirty mode, I'd view this as a > bug which could be addressed purely in the tool stack if this check > wasn't there. I want to clean up PV, but again, it wasn't sufficiently trivially-safe to do right now. PV is weird because it is neither hap_enabled() (fundamentally), nor shadow_enabled() when logdirty isn't active. While the freelist is suitably constructed, the get/set operations were previously rejected and cleanup is local to the disable op, not domain shutdown. I could put in a /* TODO: relax in due course */ if you'd prefer? >> + if ( size & ~PAGE_MASK ) /* Non page-sized request? */ >> + return -EINVAL; >> + >> + ASSERT(paging_mode_enabled(d)); > Not only with the PV aspect in mind - why? It looks reasonable to me > to set the pool size before enabling any paging mode. Because this is how all the existing logic is expressed, and this patch wants backporting. There is sooo much to clean up... > >> + paging_lock(d); >> + if ( hap_enabled(d) ) >> + rc = hap_set_allocation(d, pages, &preempted); >> + else >> + rc = shadow_set_allocation(d, pages, &preempted); > Potential truncation from the "unsigned long" -> "unsigned int" > conversions. I'd not even spotted that ARM and x86 were different in this regard. More short term hacks, it seems. ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.