Xen project Mailing List

Re: [PATCH v12 5/8] xen/riscv: add minimal stuff to mm.h to build full Xen

To: "Oleksii K." <oleksii.kurochko@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Fri, 14 Jun 2024 10:47:39 +0100

Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; keydata= xsFNBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABzSlBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPsLBegQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86M7BTQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAcLB XwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==

Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>

Delivery-date: Fri, 14 Jun 2024 09:47:47 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 11/06/2024 7:23 pm, Oleksii K. wrote: > On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote: >> On 30/05/2024 7:22 pm, Oleksii K. wrote: >>> On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: >>>> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: >>>>> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx> >>>>> Acked-by: Jan Beulich <jbeulich@xxxxxxxx> >>>> This patch looks like it can go in independently? Or does it >>>> depend >>>> on >>>> having bitops.h working in practice? >>>> >>>> However, one very strong suggestion... >>>> >>>> >>>>> diff --git a/xen/arch/riscv/include/asm/mm.h >>>>> b/xen/arch/riscv/include/asm/mm.h >>>>> index 07c7a0abba..cc4a07a71c 100644 >>>>> --- a/xen/arch/riscv/include/asm/mm.h >>>>> +++ b/xen/arch/riscv/include/asm/mm.h >>>>> @@ -3,11 +3,246 @@ >>>>> <snip> >>>>> +/* PDX of the first page in the frame table. */ >>>>> +extern unsigned long frametable_base_pdx; >>>>> + >>>>> +/* Convert between machine frame numbers and page-info >>>>> structures. >>>>> */ >>>>> +#define >>>>> mfn_to_page(mfn) \ >>>>> + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) >>>>> +#define >>>>> page_to_mfn(pg) \ >>>>> + pdx_to_mfn((unsigned long)((pg) - frame_table) + >>>>> frametable_base_pdx) >>>> Do yourself a favour and not introduce frametable_base_pdx to >>>> begin >>>> with. >>>> >>>> Every RISC-V board I can find has things starting from 0 in >>>> physical >>>> address space, with RAM starting immediately after. >>> I checked Linux kernel and grep there: >>> [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ >>> -- >>> exclude "*.tmp" -I >>> arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive- >>> 2.dtsi:33: >>> memory@40000000 { >>> arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: >>> ddrc_cache: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: >>> ddrc_cache_hi: >>> memory@1040000000 { >>> arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: >>> ddrc_cache_hi: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: >>> ddrc_cache_hi: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: >>> ddrc_cache_hi: >>> memory@1040000000 { >>> arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: memory@0 >>> { >>> arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14: >>> memory@0 { >>> arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: memory@80000000 >>> { >>> arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: >>> memory@80000000 { >>> >>> And based on that majority of supported by Linux kernel boards has >>> RAM >>> starting not from 0 in physical address space. Am I confusing >>> something? >>> >>>> Taking the microchip board as an example, RAM actually starts at >>>> 0x8000000, >>> Today we had conversation with the guy from SiFive in xen-devel >>> channel >>> and he mentioned that they are using "starfive visionfive2 and >>> sifive >>> unleashed platforms." which based on the grep above has RAM not at >>> 0 >>> address. >>> >>> Also, QEMU uses 0x8000000. >>> >>>> which means that having frametable_base_pdx and assuming it >>>> does get set to 0x8000 (which isn't even a certainty, given that >>>> I >>>> think >>>> you'll need struct pages covering the PLICs), then what you are >>>> trading >>>> off is: >>>> >>>> * Saving 32k of virtual address space only (no need to even >>>> allocate >>>> memory for this range of the framtable), by >>>> * Having an extra memory load and add/sub in every page <-> mfn >>>> conversion, which is a screaming hotpath all over Xen. >>>> >>>> It's a terribly short-sighted tradeoff. >>>> >>>> 32k of VA space might be worth saving in a 32bit build (I >>>> personally >>>> wouldn't - especially as there's no need to share Xen's VA space >>>> with >>>> guests, given no PV guests on ARM/RISC-V), but it's absolutely >>>> not at >>>> all in an a 64bit build with TB of VA space available. >>>> >>>> Even if we do find a board with the first interesting thing in >>>> the >>>> frametable starting sufficiently away from 0 that it might be >>>> worth >>>> considering this slide, then it should still be Kconfig-able in a >>>> similar way to PDX_COMPRESSION. >>> I found your tradeoffs reasonable, but I don't understand how it >>> will >>> work if RAM does not start from 0, as the frametable address and >>> RAM >>> address are linked. >>> I tried to look at the PDX_COMPRESSION config and couldn't find any >>> "slide" there. Could you please clarify this for me? >>> If to use this "slide" would it help to avoid the mentioned above >>> tradeoffs? >>> >>> One more question: if we decide to go without frametable_base_pdx, >>> would it be sufficient to simply remove mentions of it from the >>> code ( >>> at least, for now )? >> There is a relationship between system/host physical addresses (what >> Xen >> called maddr/mfn), and the frametable. The frametable has one entry >> per >> mfn. >> >> In the most simple case, there's a 1:1 relationship. i.e. >> frametable[0] >> = maddr(0), frametable[1] = maddr(4k) etc. This is very simple, and >> very easy to calculate (page_to_mfn()/mfn_to_page()). >> >> The frametable is one big array. It starts at a compile-time fixed >> address, and needs to be long enough to cover everything interesting >> in >> memory. Therefore it potentially takes a large amount of virtual >> address space. >> >> However, only interesting maddrs need to have data in the frametable, >> so >> it's fine for the backing RAM to be sparsely allocated/mapped in the >> frametable virtual addresses. >> >> For 64bit, that's really all you need, because there's always far >> more >> virtual address space than physical RAM in the system, even when >> you're >> looking at TB-scale giant servers. >> >> >> For 32bit, virtual address space is a limited resource. (Also to an >> extent, 64bit x86 with PV guests because we give 98% of the virtual >> address space to the guest kernel to use.) >> >> There are two tricks to reduce the virtual address space used, but >> they >> both cost performance in fastpaths. >> >> 1) PDX Compression. >> >> PDX compression makes a non-linear mfn <-> maddr mapping. This is >> for a >> usecase where you've got multiple RAM banks which are separated by a >> large distance (and evenly spaced), then you can "compress" a single >> range of 0's out of the middle of the system/host physical address. >> >> The cost is that all page <-> mfn conversions need to read two masks >> and >> a shift-count from variables in memory, to split/shift/recombine the >> address bits. >> >> 2) A slide, which is frametable_base_pdx in this context. >> >> When there's a big gap between 0 and the start of something >> interesting, >> you could chop out that range by just subtracting base_pdx. What >> qualifies as "big" is subjective, but Qemu starting at 128M certainly >> does not qualify as big enough to warrant frametable_base_pdx. >> >> This is less expensive that PDX compression. It only adds one memory >> read to the fastpath, but it also doesn't save as much virtual >> address >> space as PDX compression. >> >> >> When virtual address space is a major constraint (32 bit builds), >> both >> of these techniques are worth doing. But when there's no constraint >> on >> virtual address space (64 bit builds), there's no reason to use >> either; >> and the performance will definitely improve as a result. > Thanks for such good explanation. > > For RISC-V we have PDX config disabled as I haven't seen multiple RAM > banks at boards which has hypervisor extension. Thereby mfn_to_pdx() > and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case > of PDX disabled, it just an offset ( or a slide ). > > IIUUC, you meant that it make sense to map frametable not to the > address of where RAM is started, but to 0, so then we don't need this > +-frametable_base_pdx. The price for that is loosing of VA space for > the range from 0 to RAM start address. > > Right now, we are trying to support 3 boards with the following RAM > address: > 1. 0x8000_0000 - QEMU and SiFive board > 2. 0x40_0000_0000 - Microchip board > > So if we mapping frametable to 0 ( not RAM start ) we will loose: > 1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] ) > * 64 ( size of struct page_info ) = 32 MB > 2. 0x40_0000_0 ( amount of page entries to cover range [0, > 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb > > In terms of available virtual address space for RV-64 we can consider > both options as acceptable. For Qemu and SiFive, 32M is definitely not worth worrying about. I personally wouldn't worry about Microchip either. That's 4G of 1T VA space (assuming Sv39). > [OPTION 1] If we accepting of loosing 4 Gb of VA then we could > implement mfn_to_page() and page_to_mfn() in the following way: > ``` > diff --git a/xen/arch/riscv/include/asm/mm.h > b/xen/arch/riscv/include/asm/mm.h > index cc4a07a71c..fdac7e0646 100644 > --- a/xen/arch/riscv/include/asm/mm.h > +++ b/xen/arch/riscv/include/asm/mm.h > @@ -107,14 +107,11 @@ struct page_info > > #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) > > -/* PDX of the first page in the frame table. */ > -extern unsigned long frametable_base_pdx; > - > /* Convert between machine frame numbers and page-info structures. > */ > #define mfn_to_page(mfn) > \ > - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > + (frame_table + mfn)) > #define page_to_mfn(pg) > \ > - pdx_to_mfn((unsigned long)((pg) - frame_table) + > frametable_base_pdx) > + ((unsigned long)((pg) - frame_table)) > > static inline void *page_to_virt(const struct page_info *pg) > { > diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c > index 9c0fd80588..8f6dbdc699 100644 > --- a/xen/arch/riscv/mm.c > +++ b/xen/arch/riscv/mm.c > @@ -15,7 +15,7 @@ > #include <asm/page.h> > #include <asm/processor.h> > > -unsigned long __ro_after_init frametable_base_pdx; > unsigned long __ro_after_init frametable_virt_end; > > struct mmu_desc { > ``` I firmly recommend option 1, especially at this point. If specific boards really have a problem with losing 4G of VA space, then option 2 can be added easily at a later point. That said, I'd think carefully about doing option 2. Even subtracting a constant - which is far better than subtracting a variable - is still extra overhead in fastpaths. Option 2 needs careful consideration on a board-by-board case. ~Andrew

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.