[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v5 2/4] xen/riscv: introduce setup_initial_pages
On Thu, 2023-04-20 at 16:36 +0200, Jan Beulich wrote: > On 19.04.2023 17:42, Oleksii Kurochko wrote: > > --- a/xen/arch/riscv/include/asm/page-bits.h > > +++ b/xen/arch/riscv/include/asm/page-bits.h > > @@ -1,6 +1,16 @@ > > #ifndef __RISCV_PAGE_BITS_H__ > > #define __RISCV_PAGE_BITS_H__ > > > > +#ifdef CONFIG_RISCV_64 > > +#define PAGETABLE_ORDER (9) > > +#else /* CONFIG_RISCV_32 */ > > +#define PAGETABLE_ORDER (10) > > +#endif > > + > > +#define PAGETABLE_ENTRIES (1 << PAGETABLE_ORDER) > > + > > +#define PTE_PPN_SHIFT 10 > > + > > #define PAGE_SHIFT 12 /* 4 KiB Pages */ > > #define PADDR_BITS 56 /* 44-bit PPN */ > > Personally I think these two would better remain at the top. But > maybe > that's just me ... I'm ok to remain them at the top. > > > --- /dev/null > > +++ b/xen/arch/riscv/include/asm/page.h > > @@ -0,0 +1,63 @@ > > +#ifndef _ASM_RISCV_PAGE_H > > +#define _ASM_RISCV_PAGE_H > > + > > +#include <xen/const.h> > > +#include <xen/types.h> > > + > > +#define VPN_MASK ((unsigned > > long)(PAGETABLE_ENTRIES - 1)) > > + > > +#define XEN_PT_LEVEL_ORDER(lvl) ((lvl) * PAGETABLE_ORDER) > > +#define XEN_PT_LEVEL_SHIFT(lvl) (XEN_PT_LEVEL_ORDER(lvl) + > > PAGE_SHIFT) > > +#define XEN_PT_LEVEL_SIZE(lvl) (_AT(paddr_t, 1) << > > XEN_PT_LEVEL_SHIFT(lvl)) > > +#define XEN_PT_LEVEL_MAP_MASK(lvl) (~(XEN_PT_LEVEL_SIZE(lvl) - > > 1)) > > +#define XEN_PT_LEVEL_MASK(lvl) (VPN_MASK << > > XEN_PT_LEVEL_SHIFT(lvl)) > > + > > +#define PTE_VALID BIT(0, UL) > > +#define PTE_READABLE BIT(1, UL) > > +#define PTE_WRITABLE BIT(2, UL) > > +#define PTE_EXECUTABLE BIT(3, UL) > > +#define PTE_USER BIT(4, UL) > > +#define PTE_GLOBAL BIT(5, UL) > > +#define PTE_ACCESSED BIT(6, UL) > > +#define PTE_DIRTY BIT(7, UL) > > +#define PTE_RSW (BIT(8, UL) | BIT(9, UL)) > > + > > +#define PTE_LEAF_DEFAULT (PTE_VALID | PTE_READABLE | > > PTE_WRITABLE) > > +#define PTE_TABLE (PTE_VALID) > > + > > +/* Calculate the offsets into the pagetables for a given VA */ > > +#define pt_linear_offset(lvl, va) ((va) >> > > XEN_PT_LEVEL_SHIFT(lvl)) > > + > > +#define pt_index(lvl, va) pt_linear_offset(lvl, (va) & > > XEN_PT_LEVEL_MASK(lvl)) > > + > > +/* Page Table entry */ > > +typedef struct { > > +#ifdef CONFIG_RISCV_64 > > + uint64_t pte; > > +#else > > + uint32_t pte; > > +#endif > > +} pte_t; > > + > > +#define pte_to_addr(x) (((x) >> PTE_PPN_SHIFT) << PAGE_SHIFT) > > This will be at risk of overflow for RV32 without a cast to paddr_t > (which > I expect would be a 64-bit type on RV32 just like it is on RV64). Yeah, paddr_t is u64 for both RV32 and RV64. I'll add cast to paddr_t to avoid possible risk of overflow. > > > +#define addr_to_pte(x) (((x) >> PAGE_SHIFT) << PTE_PPN_SHIFT) > > + > > +static inline pte_t paddr_to_pte(const paddr_t paddr, > > + const unsigned long permissions) > > +{ > > + unsigned long tmp = addr_to_pte(paddr); > > + return (pte_t) { .pte = tmp | permissions }; > > +} > > + > > +static inline paddr_t pte_to_paddr(const pte_t pte) > > +{ > > + return pte_to_addr(pte.pte); > > +} > > + > > +static inline bool pte_is_valid(const pte_t p) > > +{ > > + return p.pte & PTE_VALID; > > +} > > A remark on all of the "const" here: It's fine if you want to keep > them, > but generally we care about const-correctness only for pointed-to > types. > Scalar and compound parameter values are owned by called function > anyway, > so all the "const" achieves is that the function can't alter its own > argument(s). Then it doesn't make for the current case and removing them will simplify paddr_to_pte function. So I prefer to remove them. > > > --- /dev/null > > +++ b/xen/arch/riscv/mm.c > > @@ -0,0 +1,319 @@ > > +#include <xen/compiler.h> > > +#include <xen/init.h> > > +#include <xen/kernel.h> > > + > > +#include <asm/early_printk.h> > > +#include <asm/config.h> > > +#include <asm/csr.h> > > +#include <asm/mm.h> > > +#include <asm/page.h> > > +#include <asm/processor.h> > > + > > +struct mmu_desc { > > + unsigned long num_levels; > > Isn't "unsigned int" sufficient here? Yes, it will be sufficient. > > > +/* > > + * It is expected that Xen won't be more then 2 MB. > > + * The check in xen.lds.S guarantees that. > > + * At least 4 page (in case when Sv48 or Sv57 are used ) > > + * tables are needed to cover 2 MB. One for each page level > > + * table with PAGE_SIZE = 4 Kb > > + * > > + * One L0 page table can cover 2 MB > > + * (512 entries of one page table * PAGE_SIZE). > > + * > > + * It might be needed one more page table in case when > > + * Xen load address isn't 2 MB aligned. > > + * > > + */ > > +#define PGTBL_INITIAL_COUNT (5) > > On x86 we have a CONFIG_PAGING_LEVELS constant. If you had something > like this as well, this 5 would better match the comment as > ((CONFIG_PAGING_LEVELS - 1) + 1), avoiding to make space for the two > levels you won't need as long as only Sv39 is really meant to be > used. Thanks for note. I'll use CONFIG_PAGING_LEVELS too. > > > +#define PGTBL_ENTRY_AMOUNT (PAGE_SIZE / sizeof(pte_t)) > > + > > +pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE) > > +stage1_pgtbl_root[PGTBL_ENTRY_AMOUNT]; > > + > > +pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE) > > +stage1_pgtbl_nonroot[PGTBL_INITIAL_COUNT * PGTBL_ENTRY_AMOUNT]; > > > > +#define > > HANDLE_PGTBL(curr_lvl_num) > > \ > > + index = pt_index(curr_lvl_num, > > page_addr); \ > > + if ( pte_is_valid(pgtbl[index]) > > ) \ > > + > > { > > \ > > + /* Find L{ 0-3 } table > > */ \ > > + pgtbl = (pte_t > > *)pte_to_paddr(pgtbl[index]); \ > > + > > } > > \ > > + > > else > > \ > > + > > { > > \ > > + /* Allocate new L{0-3} page table > > */ \ > > + if ( mmu_desc->pgtbl_count == PGTBL_INITIAL_COUNT > > ) \ > > + > > { > > \ > > + early_printk("(XEN) No initial table > > available\n"); \ > > + /* panic(), BUG() or ASSERT() aren't ready now. > > */ \ > > + > > die(); \ > > + > > } > > \ > > + mmu_desc- > > >pgtbl_count++; \ > > + pgtbl[index] = paddr_to_pte((unsigned long)mmu_desc- > > >next_pgtbl, \ > > + > > PTE_VALID); \ > > + pgtbl = mmu_desc- > > >next_pgtbl; \ > > + mmu_desc->next_pgtbl += > > PGTBL_ENTRY_AMOUNT; \ > > + } > > + > > +static void __init setup_initial_mapping(struct mmu_desc > > *mmu_desc, > > + unsigned long map_start, > > + unsigned long map_end, > > + unsigned long pa_start, > > + unsigned long > > permissions) > > Wouldn't the last one more sensibly be "unsigned int"? It would. Thanks, I'll update the code. > > > +{ > > + unsigned int index; > > + pte_t *pgtbl; > > + unsigned long page_addr; > > + pte_t pte_to_be_written; > > + unsigned long paddr; > > + unsigned long tmp_permissions; > > + > > + if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) ) > > + { > > + early_printk("(XEN) Xen should be loaded at 4k > > boundary\n"); > > + die(); > > + } > > + > > + if ( map_start & ~XEN_PT_LEVEL_MAP_MASK(0) || > > + pa_start & ~XEN_PT_LEVEL_MAP_MASK(0) ) > > + { > > + early_printk("(XEN) map and pa start addresses should be > > aligned\n"); > > + /* panic(), BUG() or ASSERT() aren't ready now. */ > > + die(); > > + } > > + > > + page_addr = map_start; > > + while ( page_addr < map_end ) > > + { > > + pgtbl = mmu_desc->pgtbl_base; > > + > > + switch (mmu_desc->num_levels) > > Nit: Style (missing blanks). Thanks. Ill update. > > > + { > > + case 4: /* Level 3 */ > > Nit: Indentation of case labels matches that of the opening brace in > our > style. Sure. Missed that. > > > + HANDLE_PGTBL(3); > > + case 3: /* Level 2 */ > > + HANDLE_PGTBL(2); > > + case 2: /* Level 1 */ > > + HANDLE_PGTBL(1); > > + case 1: /* Level 0 */ > > + index = pt_index(0, page_addr); > > + paddr = (page_addr - map_start) + pa_start; > > + > > + tmp_permissions = permissions; > > + > > + if ( is_kernel_text(LINK_TO_LOAD(page_addr)) || > > + is_kernel_inittext(LINK_TO_LOAD(page_addr)) ) > > + tmp_permissions = > > + PTE_EXECUTABLE | PTE_READABLE | PTE_VALID; > > + > > + if ( is_kernel_rodata(LINK_TO_LOAD(page_addr)) ) > > + tmp_permissions = PTE_READABLE | PTE_VALID; > > + > > + pte_to_be_written = paddr_to_pte(paddr, > > tmp_permissions); > > + > > + if ( !pte_is_valid(pgtbl[index]) ) > > + pgtbl[index] = pte_to_be_written; > > + else > > + { > > + /* > > + * get an adresses of current pte and that one > > to > > + * be written without permission flags > > + */ > > + unsigned long curr_pte = > > + pgtbl[index].pte & ~((1 << PTE_PPN_SHIFT) > > - 1); > > + > > + pte_to_be_written.pte &= ~((1 << > > PTE_PPN_SHIFT) - 1); > > If you mean to only compare addresses, why don't you use > pte_to_paddr()? Yes, it will be better to use pte_to_paddr(). > Question though is whether it's correct to ignore permission > differenes. > I'd expect you only want to mask off PTE_ACCESSED and PTE_DIRTY. Initially I would like to do rough check but probably you are right and it will be better to mask off only PTE_ACCESSED and PTE_DIRTY. > > > + if ( curr_pte != pte_to_be_written.pte ) > > + { > > + early_printk("PTE that is intended to > > write isn't the" > > + "same that the once are > > overwriting\n"); > > Nit: One-off indentation. As to the message text - I take it that's > temporary only anyway, and hence there's little point thinking about > improving it? Probably it would be more clear: "PTE overridden has occurred" > > > + /* panic(), <asm/bug.h> aren't ready now. > > */ > > + die(); > > + } > > + } > > + } > > + > > + /* Point to next page */ > > + page_addr += XEN_PT_LEVEL_SIZE(0); > > Seeing this as well as the loop heading - maybe more suitable as a > for(;;) loop? I am not sure that I understand the benefits of changing "while ( page_addr < map_end )" to "for(;;)". Could you please explain me what the benefits are? > > > + } > > +} > > Since HANDLE_PGTBL() is strictly for use above only, I think you'd > better > #undef it here. > > > +static void __init calc_pgtbl_lvls_num(struct mmu_desc *mmu_desc) > > +{ > > + unsigned long satp_mode = RV_STAGE1_MODE; > > + > > + /* Number of page table levels */ > > + switch (satp_mode) > > + { > > + case SATP_MODE_SV32: > > + mmu_desc->num_levels = 2; > > + break; > > + case SATP_MODE_SV39: > > + mmu_desc->num_levels = 3; > > + break; > > + case SATP_MODE_SV48: > > + mmu_desc->num_levels = 4; > > + break; > > + default: > > + early_printk("(XEN) Unsupported SATP_MODE\n"); > > + die(); > > + } > > +} > > + > > +static bool __init check_pgtbl_mode_support(struct mmu_desc > > *mmu_desc, > > + unsigned long > > load_start, > > + unsigned long > > satp_mode) > > +{ > > + bool is_mode_supported = false; > > + unsigned int index; > > + unsigned int page_table_level = (mmu_desc->num_levels - 1); > > + unsigned level_map_mask = > > XEN_PT_LEVEL_MAP_MASK(page_table_level); > > + > > + unsigned long aligned_load_start = load_start & > > level_map_mask; > > + unsigned long aligned_page_size = > > XEN_PT_LEVEL_SIZE(page_table_level); > > + unsigned long xen_size = (unsigned long)(_end - start); > > + > > + if ( (load_start + xen_size) > (aligned_load_start + > > aligned_page_size) ) > > + { > > + early_printk("please place Xen to be in range of PAGE_SIZE > > " > > + "where PAGE_SIZE is XEN_PT_LEVEL_SIZE( {L3 | > > L2 | L1} ) " > > + "depending on expected SATP_MODE \n" > > + "XEN_PT_LEVEL_SIZE is defined in > > <asm/page.h>\n"); > > + die(); > > + } > > + > > + index = pt_index(page_table_level, aligned_load_start); > > + stage1_pgtbl_root[index] = paddr_to_pte(aligned_load_start, > > + PTE_LEAF_DEFAULT | > > PTE_EXECUTABLE); > > + > > + asm volatile("sfence.vma"); > > Nit (here and several times again below): Style (missing three > blanks, as > "asm" is a keyword). Thanks. I'll add them > > > + csr_write(CSR_SATP, > > + ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) | > > + satp_mode << SATP_MODE_SHIFT); > > + > > + if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode ) > > + is_mode_supported = true; > > + > > + /* Clean MMU root page table and disable MMU */ > > + stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0); > > + > > + csr_write(CSR_SATP, 0); > > + asm volatile("sfence.vma"); > > I guess what you do in this function could do with some more > comments. > Looks like you're briefly enabling the MMU to check that what you > wrote > to SATP you can also read back. (Isn't there a register reporting > whether the feature is available?) I supposed that it has to be but I couldn't find a register in docs. > > If so, I think you cannot clear the stage1_pgtbl_root[] slot before > you've disabled the MMU again. > > (As a minor aspect, I'd like to encourage you to write plain zero > just > as 0, not as 0x0, unless used in contexts where other hex numbers > nearby > make this desirable.) You are right it should be cleared after csr_write(CSR_SATP, 0) > > > + return is_mode_supported; > > +} > > + > > +/* > > + * setup_initial_pagetables: > > + * > > + * Build the page tables for Xen that map the following: > > + * 1. Calculate page table's level numbers. > > + * 2. Init mmu description structure. > > + * 3. Check that linker addresses range doesn't overlap > > + * with load addresses range > > + * 4. Map all linker addresses and load addresses ( it shouldn't > > + * be 1:1 mapped and will be 1:1 mapped only in case if > > + * linker address is equal to load address ) with > > + * RW permissions by default. > > + * 5. Setup proper PTE permissions for each section. > > + */ > > +void __init setup_initial_pagetables(void) > > +{ > > + struct mmu_desc mmu_desc = { 0, 0, NULL, 0 }; > > Just {} ought to do as initializer, but if you want to spell things > out, > then please use 0 / NULL consistently for integral / pointer type > fields. Thanks. > > > + /* > > + * Access to _{stard, end } is always PC-relative > > I guess you mean _start. For just a leading underscore I also don't > think > using this braced form is useful. OK then I'll update the comment w/o usage of braced form. > > > + * thereby when access them we will get load adresses > > + * of start and end of Xen > > + * To get linker addresses LOAD_TO_LINK() is required > > + * to use > > + */ > > + unsigned long load_start = (unsigned long)_start; > > + unsigned long load_end = (unsigned long)_end; > > + unsigned long linker_start = LOAD_TO_LINK(load_start); > > + unsigned long linker_end = LOAD_TO_LINK(load_end); > > + > > + if ( (linker_start != load_start) && > > + (linker_start <= load_end) && (load_start <= linker_end) > > ) { > > Nit: Style (brace placement). Thanks, > > > + early_printk("(XEN) linker and load address ranges > > overlap\n"); > > + die(); > > + } > > + > > + calc_pgtbl_lvls_num(&mmu_desc); > > + > > + if ( !check_pgtbl_mode_support(&mmu_desc, load_start, > > RV_STAGE1_MODE) ) > > + { > > + early_printk("requested MMU mode isn't supported by CPU\n" > > + "Please choose different in > > <asm/config.h>\n"); > > + die(); > > + } > > + > > + mmu_desc.pgtbl_base = stage1_pgtbl_root; > > + mmu_desc.next_pgtbl = stage1_pgtbl_nonroot; > > + > > + setup_initial_mapping(&mmu_desc, > > + linker_start, > > + linker_end, > > + load_start, > > + PTE_LEAF_DEFAULT); > > +} > > + > > +void __init noinline enable_mmu() > > +{ > > + /* > > + * Calculate a linker time address of the mmu_is_enabled > > + * label and update CSR_STVEC with it. > > + * MMU is configured in a way where linker addresses are > > mapped > > + * on load addresses so in a case when linker addresses are > > not equal > > + * to load addresses, after MMU is enabled, it will cause > > + * an exception and jump to linker time addresses. > > + * Otherwise if load addresses are equal to linker addresses > > the code > > + * after mmu_is_enabled label will be executed without > > exception. > > + */ > > + csr_write(CSR_STVEC, LOAD_TO_LINK((unsigned > > long)&&mmu_is_enabled)); > > + > > + /* Ensure page table writes precede loading the SATP */ > > + asm volatile("sfence.vma"); > > + > > + /* Enable the MMU and load the new pagetable for Xen */ > > + csr_write(CSR_SATP, > > + ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) | > > Please try to avoid open-coding of existing constructs: Here you mean > either PFN_DOWN() or paddr_to_pfn() (you see, we have even two). (As > I > notice I did overlook at least similar earlier instance.) Sure. Thanks. > > > + RV_STAGE1_MODE << SATP_MODE_SHIFT); > > + > > + asm volatile(".align 2"); > > + mmu_is_enabled: > > How in the world is one to spot this label? Yes, it shouldn't be > entirely > unindented. But it also should be indented less than the surrounding > code > (with the exception of switch() statement case labels, where a non- > case > label might be intended the same as a case ones). Our rule of thumb > is to > indent such labels by a single space. Thanks. It looks like I misunderstood previous comment about label indentation. > > > + /* > > + * Stack should be re-inited as: > > + * 1. Right now an address of the stack is relative to load > > time > > + * addresses what will cause an issue in case of load start > > address > > + * isn't equal to linker start address. > > + * 2. Addresses in stack are all load time relative which can > > be an > > + * issue in case when load start address isn't equal to > > linker > > + * start address. > > + */ > > + asm volatile ("mv sp, %0" : : "r"((unsigned > > long)cpu0_boot_stack + STACK_SIZE)); > > Nit: Style (overly long line). > > What's worse - I don't think it is permitted to alter sp in the > middle of > a function. The compiler may maintain local variables on the stack > which > don't correspond to any programmer specified ones, including pointer > ones > which point into the stack frame. This is specifically why both x86 > and > Arm have switch_stack_and_jump() macros. but the macros (from ARM) looks equal to the code mentioned above: #define switch_stack_and_jump(stack, fn) do { \ asm volatile ("mov sp,%0; b " STR(fn) : : "r" (stack), "X" (fn) : "memory" ); \ unreachable(); \ } while ( false ) Here is part of disassembled enable_mmu(): ffffffffc004aedc: 18079073 csrw satp,a5 ffffffffc004aee0: 00016797 auipc a5,0x16 ffffffffc004aee4: 12078793 addi a5,a5,288 ffffffffc004aee8: 813e mv sp,a5 ffffffffc004af00: 0f4000ef jal ra,ffffffffc004aff4 <cont_after_mmu_is_enabled> ... > > > + /* > > + * We can't return to the caller because the stack was reseted > > + * and it may have stash some variable on the stack. > > + * Jump to a brand new function as the stack was reseted > > + */ > > Nit: Style (indentation). Thanks. > > > + cont_after_mmu_is_enabled(); > > +} > > + > > diff --git a/xen/arch/riscv/riscv64/head.S > > b/xen/arch/riscv/riscv64/head.S > > index 8887f0cbd4..b3309d902c 100644 > > --- a/xen/arch/riscv/riscv64/head.S > > +++ b/xen/arch/riscv/riscv64/head.S > > @@ -1,4 +1,5 @@ > > #include <asm/asm.h> > > +#include <asm/asm-offsets.h> > > #include <asm/riscv_encoding.h> > > > > .section .text.header, "ax", %progbits > > @@ -32,3 +33,4 @@ ENTRY(start) > > add sp, sp, t0 > > > > tail start_xen > > + > > ??? Shouldn't it be the one empty line at the end of a file? > > > --- a/xen/arch/riscv/xen.lds.S > > +++ b/xen/arch/riscv/xen.lds.S > > @@ -136,6 +136,7 @@ SECTIONS > > . = ALIGN(POINTER_ALIGN); > > __init_end = .; > > > > + . = ALIGN(PAGE_SIZE); > > .bss : { /* BSS */ > > __bss_start = .; > > *(.bss.stack_aligned) > > Why do you need this? You properly use __aligned(PAGE_SIZE) for the > page tables you define, and PAGE_SIZE wouldn't be enough here anyway > if STACK_SIZE > PAGE_SIZE (as .bss.stack_aligned comes first). The > only time you'd need such an ALIGN() is if the following label > (__bss_start in this case) needed to be aligned at a certain > boundary. (I'm a little puzzled though that __bss_start isn't > [immediately] preceded by ". = ALIGN(POINTER_ALIGN);" - didn't .bss > clearing rely on such alignment?) ALIGN(PAGE_SIZE) isn't needed anymore. I used it to have 4k aligned physical address for PTE when I mapped each section separately ( it was so in the previous verstion of MMU patch series ) Regarding ". = ALIGN(POINTER_ALIGN);" I would say that it is enough to have aligned __bss_end ( what was done ) to be sure that we can clear __SIZEOF_POINTER__ bytes each iteration of .bss clearing loop and don't worry that size of .bss section may not be divisible by __SIZEOF_POINTER__. ~ Oleksii
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |