[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v12 5/8] xen/riscv: add minimal stuff to mm.h to build full Xen
On 11/06/2024 7:23 pm, Oleksii K. wrote: > On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote: >> On 30/05/2024 7:22 pm, Oleksii K. wrote: >>> On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote: >>>> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote: >>>>> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx> >>>>> Acked-by: Jan Beulich <jbeulich@xxxxxxxx> >>>> This patch looks like it can go in independently? Or does it >>>> depend >>>> on >>>> having bitops.h working in practice? >>>> >>>> However, one very strong suggestion... >>>> >>>> >>>>> diff --git a/xen/arch/riscv/include/asm/mm.h >>>>> b/xen/arch/riscv/include/asm/mm.h >>>>> index 07c7a0abba..cc4a07a71c 100644 >>>>> --- a/xen/arch/riscv/include/asm/mm.h >>>>> +++ b/xen/arch/riscv/include/asm/mm.h >>>>> @@ -3,11 +3,246 @@ >>>>> <snip> >>>>> +/* PDX of the first page in the frame table. */ >>>>> +extern unsigned long frametable_base_pdx; >>>>> + >>>>> +/* Convert between machine frame numbers and page-info >>>>> structures. >>>>> */ >>>>> +#define >>>>> mfn_to_page(mfn) \ >>>>> + (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) >>>>> +#define >>>>> page_to_mfn(pg) \ >>>>> + pdx_to_mfn((unsigned long)((pg) - frame_table) + >>>>> frametable_base_pdx) >>>> Do yourself a favour and not introduce frametable_base_pdx to >>>> begin >>>> with. >>>> >>>> Every RISC-V board I can find has things starting from 0 in >>>> physical >>>> address space, with RAM starting immediately after. >>> I checked Linux kernel and grep there: >>> [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ >>> -- >>> exclude "*.tmp" -I >>> arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive- >>> 2.dtsi:33: >>> memory@40000000 { >>> arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49: >>> ddrc_cache: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37: >>> ddrc_cache_hi: >>> memory@1040000000 { >>> arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40: >>> ddrc_cache_hi: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27: >>> ddrc_cache_hi: >>> memory@1000000000 { >>> arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57: >>> ddrc_cache_lo: >>> memory@80000000 { >>> arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63: >>> ddrc_cache_hi: >>> memory@1040000000 { >>> arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32: memory@0 >>> { >>> arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14: >>> memory@0 { >>> arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12: memory@80000000 >>> { >>> arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: >>> memory@80000000 >>> { >>> arch/riscv/boot/dts/canaan/k210.dtsi:82: sram: >>> memory@80000000 { >>> >>> And based on that majority of supported by Linux kernel boards has >>> RAM >>> starting not from 0 in physical address space. Am I confusing >>> something? >>> >>>> Taking the microchip board as an example, RAM actually starts at >>>> 0x8000000, >>> Today we had conversation with the guy from SiFive in xen-devel >>> channel >>> and he mentioned that they are using "starfive visionfive2 and >>> sifive >>> unleashed platforms." which based on the grep above has RAM not at >>> 0 >>> address. >>> >>> Also, QEMU uses 0x8000000. >>> >>>> which means that having frametable_base_pdx and assuming it >>>> does get set to 0x8000 (which isn't even a certainty, given that >>>> I >>>> think >>>> you'll need struct pages covering the PLICs), then what you are >>>> trading >>>> off is: >>>> >>>> * Saving 32k of virtual address space only (no need to even >>>> allocate >>>> memory for this range of the framtable), by >>>> * Having an extra memory load and add/sub in every page <-> mfn >>>> conversion, which is a screaming hotpath all over Xen. >>>> >>>> It's a terribly short-sighted tradeoff. >>>> >>>> 32k of VA space might be worth saving in a 32bit build (I >>>> personally >>>> wouldn't - especially as there's no need to share Xen's VA space >>>> with >>>> guests, given no PV guests on ARM/RISC-V), but it's absolutely >>>> not at >>>> all in an a 64bit build with TB of VA space available. >>>> >>>> Even if we do find a board with the first interesting thing in >>>> the >>>> frametable starting sufficiently away from 0 that it might be >>>> worth >>>> considering this slide, then it should still be Kconfig-able in a >>>> similar way to PDX_COMPRESSION. >>> I found your tradeoffs reasonable, but I don't understand how it >>> will >>> work if RAM does not start from 0, as the frametable address and >>> RAM >>> address are linked. >>> I tried to look at the PDX_COMPRESSION config and couldn't find any >>> "slide" there. Could you please clarify this for me? >>> If to use this "slide" would it help to avoid the mentioned above >>> tradeoffs? >>> >>> One more question: if we decide to go without frametable_base_pdx, >>> would it be sufficient to simply remove mentions of it from the >>> code ( >>> at least, for now )? >> There is a relationship between system/host physical addresses (what >> Xen >> called maddr/mfn), and the frametable. The frametable has one entry >> per >> mfn. >> >> In the most simple case, there's a 1:1 relationship. i.e. >> frametable[0] >> = maddr(0), frametable[1] = maddr(4k) etc. This is very simple, and >> very easy to calculate (page_to_mfn()/mfn_to_page()). >> >> The frametable is one big array. It starts at a compile-time fixed >> address, and needs to be long enough to cover everything interesting >> in >> memory. Therefore it potentially takes a large amount of virtual >> address space. >> >> However, only interesting maddrs need to have data in the frametable, >> so >> it's fine for the backing RAM to be sparsely allocated/mapped in the >> frametable virtual addresses. >> >> For 64bit, that's really all you need, because there's always far >> more >> virtual address space than physical RAM in the system, even when >> you're >> looking at TB-scale giant servers. >> >> >> For 32bit, virtual address space is a limited resource. (Also to an >> extent, 64bit x86 with PV guests because we give 98% of the virtual >> address space to the guest kernel to use.) >> >> There are two tricks to reduce the virtual address space used, but >> they >> both cost performance in fastpaths. >> >> 1) PDX Compression. >> >> PDX compression makes a non-linear mfn <-> maddr mapping. This is >> for a >> usecase where you've got multiple RAM banks which are separated by a >> large distance (and evenly spaced), then you can "compress" a single >> range of 0's out of the middle of the system/host physical address. >> >> The cost is that all page <-> mfn conversions need to read two masks >> and >> a shift-count from variables in memory, to split/shift/recombine the >> address bits. >> >> 2) A slide, which is frametable_base_pdx in this context. >> >> When there's a big gap between 0 and the start of something >> interesting, >> you could chop out that range by just subtracting base_pdx. What >> qualifies as "big" is subjective, but Qemu starting at 128M certainly >> does not qualify as big enough to warrant frametable_base_pdx. >> >> This is less expensive that PDX compression. It only adds one memory >> read to the fastpath, but it also doesn't save as much virtual >> address >> space as PDX compression. >> >> >> When virtual address space is a major constraint (32 bit builds), >> both >> of these techniques are worth doing. But when there's no constraint >> on >> virtual address space (64 bit builds), there's no reason to use >> either; >> and the performance will definitely improve as a result. > Thanks for such good explanation. > > For RISC-V we have PDX config disabled as I haven't seen multiple RAM > banks at boards which has hypervisor extension. Thereby mfn_to_pdx() > and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case > of PDX disabled, it just an offset ( or a slide ). > > IIUUC, you meant that it make sense to map frametable not to the > address of where RAM is started, but to 0, so then we don't need this > +-frametable_base_pdx. The price for that is loosing of VA space for > the range from 0 to RAM start address. > > Right now, we are trying to support 3 boards with the following RAM > address: > 1. 0x8000_0000 - QEMU and SiFive board > 2. 0x40_0000_0000 - Microchip board > > So if we mapping frametable to 0 ( not RAM start ) we will loose: > 1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] ) > * 64 ( size of struct page_info ) = 32 MB > 2. 0x40_0000_0 ( amount of page entries to cover range [0, > 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb > > In terms of available virtual address space for RV-64 we can consider > both options as acceptable. For Qemu and SiFive, 32M is definitely not worth worrying about. I personally wouldn't worry about Microchip either. That's 4G of 1T VA space (assuming Sv39). > [OPTION 1] If we accepting of loosing 4 Gb of VA then we could > implement mfn_to_page() and page_to_mfn() in the following way: > ``` > diff --git a/xen/arch/riscv/include/asm/mm.h > b/xen/arch/riscv/include/asm/mm.h > index cc4a07a71c..fdac7e0646 100644 > --- a/xen/arch/riscv/include/asm/mm.h > +++ b/xen/arch/riscv/include/asm/mm.h > @@ -107,14 +107,11 @@ struct page_info > > #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START) > > -/* PDX of the first page in the frame table. */ > -extern unsigned long frametable_base_pdx; > - > /* Convert between machine frame numbers and page-info structures. > */ > #define mfn_to_page(mfn) > \ > - (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)) > + (frame_table + mfn)) > #define page_to_mfn(pg) > \ > - pdx_to_mfn((unsigned long)((pg) - frame_table) + > frametable_base_pdx) > + ((unsigned long)((pg) - frame_table)) > > static inline void *page_to_virt(const struct page_info *pg) > { > diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c > index 9c0fd80588..8f6dbdc699 100644 > --- a/xen/arch/riscv/mm.c > +++ b/xen/arch/riscv/mm.c > @@ -15,7 +15,7 @@ > #include <asm/page.h> > #include <asm/processor.h> > > -unsigned long __ro_after_init frametable_base_pdx; > unsigned long __ro_after_init frametable_virt_end; > > struct mmu_desc { > ``` I firmly recommend option 1, especially at this point. If specific boards really have a problem with losing 4G of VA space, then option 2 can be added easily at a later point. That said, I'd think carefully about doing option 2. Even subtracting a constant - which is far better than subtracting a variable - is still extra overhead in fastpaths. Option 2 needs careful consideration on a board-by-board case. ~Andrew
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |