[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v3 8/9] xen/riscv: page table handling
On 24.07.2024 17:31, Oleksii Kurochko wrote: > Introduces an implementation of map_pages_to_xen() which requires multiple > functions to work with page tables/entries: > - xen_pt_update() > - xen_pt_mapping_level() > - xen_pt_update_entry() > - xen_pt_next_level() > - xen_pt_check_entry() I question the need for xen_ prefixes here. > During the mentioned above function it is necessary to create Xen tables > or map/unmap them. For that it were introduced the following functions: You start the description with "Introduces ..." (yet better would be "Introduce ..."), just to switch back to past tense here again. You want to get used to describing what a patch does in imperative form. That'll make it easier to tell those parts from parts describing present / past behavior. (A patch description does specifically not describe what you did prior to submitting the patch; it describes what the patch itself does.) > - create_xen_table() > - xen_map_table() > - xen_unmap_table() > > Also it was introduced various internal macros for convience started with > _PAGE_* which almost duplicate the bits in PTE except _PAGE_BLOCK which > actually doesn't present in PTE which indicates that page larger then 4k > is needed. RISC-V doesn't have a specific bit in PTE and it detect if it > is a superpage or not only by using pte.x and pte.r. From the RISC-V spec: > ``` > ... > 4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5. > Otherwise, this PTE is a pointer to the next level of the page table. > ... . > 5. A leaf PTE has been found. > ... > ... > ``` > > Except that it was introduced flush_xen_tlb_range_va() for TLB flushing > accross CPUs when PTE for requested mapping was properly updated. On top of what I said above, I think here you mean "In addition ..." or "Additionally ...". > --- a/xen/arch/riscv/Kconfig > +++ b/xen/arch/riscv/Kconfig > @@ -2,6 +2,7 @@ config RISCV > def_bool y > select FUNCTION_ALIGNMENT_16B > select GENERIC_BUG_FRAME > + select GENERIC_PT > select HAS_DEVICE_TREE > select HAS_PMAP Stray leftover? > @@ -27,6 +29,19 @@ static inline void flush_xen_tlb_range_va_local(vaddr_t va, > } > } > > +/* > + * Flush a range of VA's hypervisor mappings from the TLB of all > + * processors in the inner-shareable domain. > + */ > +static inline void flush_xen_tlb_range_va(vaddr_t va, > + unsigned long size) > +{ > + if ( sbi_has_rfence() ) > + sbi_remote_sfence_vma(NULL, va, size); > + else > + BUG_ON("IPI support is need for remote TLB flush"); > +} static inline void flush_xen_tlb_range_va(vaddr_t va, size_t size) { BUG_ON(!sbi_has_rfence()); sbi_remote_sfence_vma(NULL, va, size); } ? Plus once again I'm uncertain about the usefulness of the "xen" infix. > --- a/xen/arch/riscv/include/asm/mm.h > +++ b/xen/arch/riscv/include/asm/mm.h > @@ -42,6 +42,8 @@ static inline void *maddr_to_virt(paddr_t ma) > #define virt_to_mfn(va) __virt_to_mfn(va) > #define mfn_to_virt(mfn) __mfn_to_virt(mfn) > > +#define pte_get_mfn(pte) maddr_to_mfn(pte_to_paddr(pte)) Along the lines of the naming remark elsewhere: mfn_from_pte()? pte_to_paddr() is ambiguous, too: What paddr is it that the construct obtains? The one matching the address of the PTE (which may be of interest when accessing page tables), or the one encoded in the PTE? Imo paddr_from_pte() or pte_get_paddr() would again better express what is being done. You may want to take a look at x86's page.h. > --- a/xen/arch/riscv/include/asm/page-bits.h > +++ b/xen/arch/riscv/include/asm/page-bits.h > @@ -3,6 +3,42 @@ > #ifndef __RISCV_PAGE_BITS_H__ > #define __RISCV_PAGE_BITS_H__ > > +/* > + * PTE format: > + * | XLEN-1 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 > + * PFN reserved for SW D A G U X W R V > + */ > + > +#define _PAGE_PRESENT BIT(0, UL) The acronym being V, would _PAGE_VALID maybe be a better name? Just like it's PTE_VALID? Speaking of which: Why do you need both PTE_* and _PAGE_*? How will one determine which one to use where? Plus imo page-bits.h is the wrong header to put these in. In fact I notice abuse of this header has already proliferated: Both PPC and RISC-V include this header explicitly, when it was introduced for just xen/page-size.h to include directly (and the definitions to put there are solely ones related to what xen/page-size.h needs / provides). The underlying idea what to simplify header dependencies. With what PPC and RISC-V do, that's being undermined. > +#define _PAGE_READ BIT(1, UL) /* Readable */ > +#define _PAGE_WRITE BIT(2, UL) /* Writable */ > +#define _PAGE_EXEC BIT(3, UL) /* Executable */ > +#define _PAGE_USER BIT(4, UL) /* User */ > +#define _PAGE_GLOBAL BIT(5, UL) /* Global */ > +#define _PAGE_ACCESSED BIT(6, UL) /* Set by hardware on any access */ > +#define _PAGE_DIRTY BIT(7, UL) /* Set by hardware on any write */ > +#define _PAGE_SOFT BIT(8, UL) /* Reserved for software */ The diagram says that's two bits. > +/* > + * There is no such bits in PTE format for RISC-V. > + * > + * _PAGE_BLOCK was introduced to have ability to tell that superpage > + * should be allocated. > + */ > +#define _PAGE_BLOCK BIT(9, UL) Imo, like on x86, super-page mappings should be the default whenever possible. Callers _not_ wanting such can request so - see x86'es MAP_SMALL_PAGES. > +#define _PAGE_W_BIT 2 > +#define _PAGE_XN_BIT 3 > +#define _PAGE_RO_BIT 1 > + > +/* TODO: move to somewhere generic part/header ? */ > +#define _PAGE_XN (1U << _PAGE_XN_BIT) > +#define _PAGE_RO (1U << _PAGE_RO_BIT) > +#define _PAGE_W (1U << _PAGE_W_BIT) > +#define PAGE_XN_MASK(x) (((x) >> _PAGE_XN_BIT) & 0x1U) > +#define PAGE_RO_MASK(x) (((x) >> _PAGE_RO_BIT) & 0x1U) > +#define PAGE_W_MASK(x) (((x) >> _PAGE_W_BIT) & 0x1U) What are all of these about? Why PAGE_XN when you have a positive X bit in the PTEs? And why 0x1U when plain 1 would do? All the PAGE_*_MASK ones are also badly named. Constructs of that name should extract the bit in its position (i.e. not shifted to bit 0), or be the name of a constant to use to do so. > --- a/xen/arch/riscv/include/asm/page.h > +++ b/xen/arch/riscv/include/asm/page.h > @@ -34,6 +34,7 @@ > #define PTE_LEAF_DEFAULT (PTE_VALID | PTE_READABLE | PTE_WRITABLE) > #define PTE_TABLE (PTE_VALID) > > +#define PAGE_HYPERVISOR_RO (PTE_VALID | PTE_READABLE) > #define PAGE_HYPERVISOR_RW (PTE_VALID | PTE_READABLE | PTE_WRITABLE) No PAGE_HYPERVISOR_RX? > @@ -43,13 +44,68 @@ > > #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) & VPN_MASK) > > -/* Page Table entry */ > +#define FIRST_SIZE (XEN_PT_LEVEL_SIZE(2)) > + > +#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & ((_AC(1, U) << > PAGETABLE_ORDER) - 1)) > + > +#if RV_STAGE1_MODE > SATP_MODE_SV48 > +#error "need to to update DECLARE_OFFSETS macros" > +#else > + > +#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va)) > +#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va)) > +#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va)) > +#define l3_table_offset(va) TABLE_OFFSET(pt_linear_offset(3, va)) > + > +/* Generate an array @var containing the offset for each level from @addr */ > +#define DECLARE_OFFSETS(var, addr) \ > + const unsigned int var[4] = { \ > + l0_table_offset(addr), \ > + l1_table_offset(addr), \ > + l2_table_offset(addr), \ > + l3_table_offset(addr) \ > + } Why would this need to have 4 entries in SV39 mode? > +#endif > + > typedef struct { > + unsigned long v:1; > + unsigned long r:1; > + unsigned long w:1; > + unsigned long x:1; > + unsigned long u:1; > + unsigned long g:1; > + unsigned long a:1; > + unsigned long d:1; bool for all of these? > + unsigned long rsw:2; > +#if RV_STAGE1_MODE == SATP_MODE_SV39 > + unsigned long ppn0:9; > + unsigned long ppn1:9; > + unsigned long ppn2:26; > + unsigned long rsw2:7; "rsw" above appear to mean "reserved for software use". What does "rsw" here mean? Should this field be "rsv"? > + unsigned long pbmt:2; > + unsigned long n:1; > +#elif RV_STAGE1_MODE == SATP_MODE_SV48 > + unsigned long ppn0:9; > + unsigned long ppn1:9; > + unsigned long ppn2:9; > + unsigned long ppn3:17; > + unsigned long rsw2:7; > + unsigned long pbmt:2; > + unsigned long n:1; > +#else > +#error "Add proper bits for SATP_MODE" > +#endif > +} pt_t; I can't spot a use anywhere of e.g. ppn0. Would be nice to understand in what contexts these bitfields are going to be used. I notice you specifically don't use them in e.g. pte_is_table(). > +/* Page Table entry */ > +typedef union { > #ifdef CONFIG_RISCV_64 > uint64_t pte; > #else > uint32_t pte; > #endif > +pt_t bits; > } pte_t; Struct field want indenting. > @@ -70,6 +126,21 @@ static inline bool pte_is_valid(pte_t p) > return p.pte & PTE_VALID; > } > > +inline bool pte_is_table(const pte_t p, unsigned int level) > +{ > + (void) level; > + > + return (((p.pte) & (PTE_VALID No need to parenthesize p.pte. > + | PTE_READABLE > + | PTE_WRITABLE > + | PTE_EXECUTABLE)) == PTE_VALID); Indentation of these lines looks to be off by 1. > +} > + > +static inline bool pte_is_mapping(const pte_t pte, unsigned int level) > +{ > + return !pte_is_table(pte, level); > +} Don't you mean V=1 and (R=1 or X=1) here? > --- /dev/null > +++ b/xen/arch/riscv/pt.c > @@ -0,0 +1,410 @@ > +#include <xen/bug.h> > +#include <xen/domain_page.h> > +#include <xen/errno.h> > +#include <xen/mm.h> > +#include <xen/mm-frame.h> > +#include <xen/pmap.h> > +#include <xen/spinlock.h> > + > +#include <asm/flushtlb.h> > +#include <asm/page.h> > + > +static inline const mfn_t get_root_page(void) > +{ > + unsigned long root_maddr = csr_read(CSR_SATP) << PAGE_SHIFT; You want to isolate the PPN part of the register before shifting. > + return maddr_to_mfn(root_maddr); > +} > + > +static inline void set_pte_permissions(pte_t *pte, unsigned int flags) > +{ > + pte->bits.r = PAGE_RO_MASK(flags); > + pte->bits.x = ~PAGE_XN_MASK(flags); > + pte->bits.w = PAGE_W_MASK(flags); > + > + pte->pte |= PTE_ACCESSED | PTE_DIRTY; > +} As indicated elsewhere, imo objects of type pte_t should not be floating around in the system without having permissions set on them. The helper creating a pte_t should take both MFN and (permission) flags. Further, with "flags" suitably arranged, all you need to do is to OR them into the shifted PPN; there's no need for a whapping 4 assignments, even if maybe the compiler can fold some of this. > +/* Sanity check of the entry */ > +static bool xen_pt_check_entry(pte_t entry, mfn_t mfn, unsigned int level, > + unsigned int flags) The comment wants extending to indicate what the parameters mean wrt what is going to be checked. For example, ... > +{ > + /* Sanity check when modifying an entry. */ > + if ( mfn_eq(mfn, INVALID_MFN) ) ... it's unclear to me why incoming INVALID_MFN would indicate modification of an entry, whereas further down _PAGE_PRESENT supposedly indicates insertion. > + { > + /* We don't allow modifying an invalid entry. */ > + if ( !pte_is_valid(entry) ) > + { > + printk("Modifying invalid entry is not allowed.\n"); > + return false; > + } > + > + /* We don't allow modifying a table entry */ > + if ( !pte_is_mapping(entry, level) ) With what the comment say, why not pte_is_table()? > + { > + printk("Modifying a table entry is not allowed.\n"); > + return false; > + } > + } > + /* Sanity check when inserting a mapping */ > + else if ( flags & _PAGE_PRESENT ) > + { > + /* We should be here with a valid MFN. */ > + ASSERT(!mfn_eq(mfn, INVALID_MFN)); > + > + /* > + * We don't allow replacing any valid entry. > + * > + * Note that the function xen_pt_update() relies on this > + * assumption and will skip the TLB flush. The function will need > + * to be updated if the check is relaxed. > + */ > + if ( pte_is_valid(entry) ) > + { > + if ( pte_is_mapping(entry, level) ) > + printk("Changing MFN for a valid entry is not allowed > (%#"PRI_mfn" -> %#"PRI_mfn").\n", > + mfn_x(pte_get_mfn(entry)), mfn_x(mfn)); Nit: Indentation. (I'm once again worried by all of these printk()s anyway.) > +#define XEN_TABLE_MAP_FAILED 0 > +#define XEN_TABLE_SUPER_PAGE 1 > +#define XEN_TABLE_NORMAL_PAGE 2 > + > +/* > + * Take the currently mapped table, find the corresponding entry, > + * and map the next table, if available. > + * > + * The read_only parameters indicates whether intermediate tables should > + * be allocated when not present. "read_only" would have a different meaning to me. Maybe use inverted sense with a name like "alloc"? > + * Return values: > + * XEN_TABLE_MAP_FAILED: Either read_only was set and the entry > + * was empty, or allocating a new page failed. > + * XEN_TABLE_NORMAL_PAGE: next level mapped normally > + * XEN_TABLE_SUPER_PAGE: The next entry points to a superpage. > + */ > +static int xen_pt_next_level(bool read_only, unsigned int level, > + pte_t **table, unsigned int offset) > +{ > + pte_t *entry; > + int ret; > + mfn_t mfn; > + > + entry = *table + offset; > + > + if ( !pte_is_valid(*entry) ) > + { > + if ( read_only ) > + return XEN_TABLE_MAP_FAILED; > + > + ret = create_xen_table(entry); > + if ( ret ) > + return XEN_TABLE_MAP_FAILED; > + } > + > + if ( pte_is_mapping(*entry, level) ) > + { > + return XEN_TABLE_SUPER_PAGE; > + } Please be consistent with braces around single statements. > + mfn = pte_get_mfn(*entry); > + > + xen_unmap_table(*table); > + *table = xen_map_table(mfn); > + > + return XEN_TABLE_NORMAL_PAGE; > +} Normal page? Not normal table? > +/* Update an entry at the level @target. */ > +static int xen_pt_update_entry(mfn_t root, unsigned long virt, > + mfn_t mfn, unsigned int arch_target, > + unsigned int flags) > +{ > + int rc; > + unsigned int level = HYP_PT_ROOT_LEVEL; > + unsigned int arch_level = level; Why two level variables? > + unsigned int target = arch_target; > + pte_t *table; > + /* > + * The intermediate page tables are read-only when the MFN is not valid > + * This means we either modify permissions or remove an entry. > + */ > + bool read_only = mfn_eq(mfn, INVALID_MFN); I'm afraid I can't make a connection between the incoming MFN being INVALID_MFN and intermediate tables being intended to remain unaltered. > + pte_t pte, *entry; > + > + /* convenience aliases */ > + DECLARE_OFFSETS(offsets, (paddr_t)virt); Unless for a mode where translation is disabled, I don't think virtual addresses should ever be converted to paddr_t. > + table = xen_map_table(root); > + for ( ; level > target; level--, arch_level = level ) > + { > + rc = xen_pt_next_level(read_only, arch_level, &table, > offsets[arch_level]); > + if ( rc == XEN_TABLE_MAP_FAILED ) > + { > + /* > + * We are here because xen_pt_next_level has failed to map > + * the intermediate page table (e.g the table does not exist > + * and the pt is read-only). It is a valid case when > + * removing a mapping as it may not exist in the page table. > + * In this case, just ignore it. > + */ > + if ( flags & _PAGE_PRESENT ) > + { > + printk("%s: Unable to map level %u\n", __func__, arch_level); > + rc = -ENOENT; > + goto out; > + } > + else > + { > + rc = 0; > + goto out; > + } Please pull out such identical "goto out". > + } > + else if ( rc != XEN_TABLE_NORMAL_PAGE ) { Nit: Brace placement. Initially I was wondering how this would have compiled for you, until I spotted the opening brace ... > + break; > + } ... matching this closing one (both of which are really unnecessary). > + } > + > + if ( arch_level != arch_target ) > + { > + printk("%s: Shattering superpage is not supported\n", __func__); > + rc = -EOPNOTSUPP; > + goto out; > + } > + > + entry = table + offsets[arch_level]; > + > + rc = -EINVAL; > + if ( !xen_pt_check_entry(*entry, mfn, arch_level, flags) ) > + goto out; > + > + /* We are removing the page */ > + if ( !(flags & _PAGE_PRESENT) ) > + memset(&pte, 0x00, sizeof(pte)); > + else > + { > + /* We are inserting a mapping => Create new pte. */ > + if ( !mfn_eq(mfn, INVALID_MFN) ) > + { > + pte = mfn_to_xen_entry(mfn, PTE_VALID); > + } > + else /* We are updating the permission => Copy the current pte. */ > + pte = *entry; > + > + set_pte_permissions(&pte, flags); > + } > + > + write_pte(entry, pte); > + > + rc = 0; > + > +out: Nit: Style (label needs indenting). > + xen_unmap_table(table); > + > + return rc; > +} > + > +static DEFINE_SPINLOCK(xen_pt_lock); > + > +/* Return the level where mapping should be done */ > +static int xen_pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned long > nr, > + unsigned int flags) > +{ > + unsigned int level = 0; > + unsigned long mask; > + int i; unsigned int? > + /* > + * Don't take into account the MFN when removing mapping (i.e > + * MFN_INVALID) to calculate the correct target order. > + * > + * `vfn` and `mfn` must be both superpage aligned. > + * They are or-ed together and then checked against the size of > + * each level. > + * > + * `left` is not included and checked separately to allow > + * superpage mapping even if it is not properly aligned (the > + * user may have asked to map 2MB + 4k). > + */ > + mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0; > + mask |= vfn; > + > + /* > + * Always use level 0 ( 4k mapping ) mapping unless the caller request > + * block mapping. > + */ > + if ( likely(!(flags & _PAGE_BLOCK)) ) > + return level; This is independent of the calculation of "mask" and hence can move up. > + for ( i = HYP_PT_ROOT_LEVEL; i > 0; i-- ) > + { > + if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) && > + (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) ) Nit: Indentation again. > + { > + level = i; > + break; > + } > + } > + > + return level; > +} > + > +static int xen_pt_update(unsigned long virt, > + mfn_t mfn, > + /* const on purpose as it is used for TLB flush */ > + const unsigned long nr_mfns, I'm afraid I don't see what the comment is actually trying to tell me. If this is about you wanting to make sure the function arguments aren't altered prematurely, then why would the same not apply to virt, mfn, and flags (all of which matter at the time the TLB flush is initiated)? > + unsigned int flags) > +{ > + int rc = 0; > + unsigned long vfn = virt >> PAGE_SHIFT; > + unsigned long left = nr_mfns; > + > + const mfn_t root = get_root_page(); > + > + /* > + * It is bad idea to have mapping both writeable and > + * executable. > + * When modifying/creating mapping (i.e _PAGE_PRESENT is set), > + * prevent any update if this happen. > + */ > + if ( (flags & _PAGE_PRESENT) && !PAGE_RO_MASK(flags) && > + !PAGE_XN_MASK(flags) ) > + { > + printk("Mappings should not be both Writeable and Executable.\n"); > + return -EINVAL; > + } > + > + if ( !IS_ALIGNED(virt, PAGE_SIZE) ) > + { > + printk("The virtual address is not aligned to the page-size.\n"); > + return -EINVAL; > + } > + > + spin_lock(&xen_pt_lock); > + > + while ( left ) > + { > + unsigned int order, level; > + > + level = xen_pt_mapping_level(vfn, mfn, left, flags); > + order = XEN_PT_LEVEL_ORDER(level); > + > + ASSERT(left >= BIT(order, UL)); > + > + rc = xen_pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level, > + flags); > + if ( rc ) > + break; > + > + vfn += 1U << order; > + if ( !mfn_eq(mfn, INVALID_MFN) ) > + mfn = mfn_add(mfn, 1U << order); > + > + left -= (1U << order); > + > + if ( rc ) > + break; > + } > + > + /* > + * The TLBs flush can be safely skipped when a mapping is inserted > + * as we don't allow mapping replacement (see xen_pt_check_entry()). > + * > + * For all the other cases, the TLBs will be flushed unconditionally > + * even if the mapping has failed. This is because we may have > + * partially modified the PT. This will prevent any unexpected > + * behavior afterwards. > + */ > + if ( !((flags & _PAGE_PRESENT) && !mfn_eq(mfn, INVALID_MFN)) ) > + flush_xen_tlb_range_va(virt, PAGE_SIZE * nr_mfns); See my comments elsewhere towards TLB behavior on RISC-V. The indicated skipping of a flush is safe only if not-present entries cannot be put in the TLB. > + spin_unlock(&xen_pt_lock); > + > + return rc; > +} > + > +int map_pages_to_xen(unsigned long virt, > + mfn_t mfn, > + unsigned long nr_mfns, > + unsigned int flags) > +{ > + return xen_pt_update(virt, mfn, nr_mfns, flags); > +} Why this wrapping of two functions taking identical arguments? Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |