[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 8/9] xen/riscv: page table handling



On Tue, 2024-07-30 at 16:22 +0200, Jan Beulich wrote:
> On 24.07.2024 17:31, Oleksii Kurochko wrote:
> > Introduces an implementation of map_pages_to_xen() which requires
> > multiple
> > functions to work with page tables/entries:
> > - xen_pt_update()
> > - xen_pt_mapping_level()
> > - xen_pt_update_entry()
> > - xen_pt_next_level()
> > - xen_pt_check_entry()
> 
> I question the need for xen_ prefixes here.
xen_ prefixes could be dropped, there is no big meaning in them.

> 
> > --- a/xen/arch/riscv/Kconfig
> > +++ b/xen/arch/riscv/Kconfig
> > @@ -2,6 +2,7 @@ config RISCV
> >   def_bool y
> >   select FUNCTION_ALIGNMENT_16B
> >   select GENERIC_BUG_FRAME
> > + select GENERIC_PT
> >   select HAS_DEVICE_TREE
> >   select HAS_PMAP
> 
> Stray leftover?
Missed to drop after I tried to make these changes as a part of common
code. Thanks.

> > 
> 
> > --- a/xen/arch/riscv/include/asm/page-bits.h
> > +++ b/xen/arch/riscv/include/asm/page-bits.h
> > @@ -3,6 +3,42 @@
> >  #ifndef __RISCV_PAGE_BITS_H__
> >  #define __RISCV_PAGE_BITS_H__
> >  
> > +/*
> > + * PTE format:
> > + * | XLEN-1  10 | 9             8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
> > + *       PFN      reserved for SW   D   A   G   U   X   W   R   V
> > + */
> > +
> > +#define _PAGE_PRESENT   BIT(0, UL)
> 
> The acronym being V, would _PAGE_VALID maybe be a better name? Just
> like it's PTE_VALID? Speaking of which: Why do you need both PTE_*
> and _PAGE_*? How will one determine which one to use where? Plus imo
> page-bits.h is the wrong header to put these in. In fact I notice
> abuse of this header has already proliferated: Both PPC and RISC-V
> include this header explicitly, when it was introduced for just
> xen/page-size.h to include directly (and the definitions to put there
> are solely ones related to what xen/page-size.h needs / provides).
> The underlying idea what to simplify header dependencies. With what
> PPC and RISC-V do, that's being undermined.
I introduced them to be aligned with ARM code as I tried to use it as a
common code but after 'page table handling' was reworked to be arch-
specific _PAGE_* defintions could be dropped and only PTE_* ones could
be used.

> 
> > +#define _PAGE_READ      BIT(1, UL)    /* Readable */
> > +#define _PAGE_WRITE     BIT(2, UL)    /* Writable */
> > +#define _PAGE_EXEC      BIT(3, UL)    /* Executable */
> > +#define _PAGE_USER      BIT(4, UL)    /* User */
> > +#define _PAGE_GLOBAL    BIT(5, UL)    /* Global */
> > +#define _PAGE_ACCESSED  BIT(6, UL)    /* Set by hardware on any
> > access */
> > +#define _PAGE_DIRTY     BIT(7, UL)    /* Set by hardware on any
> > write */
> > +#define _PAGE_SOFT      BIT(8, UL)    /* Reserved for software */
> 
> The diagram says that's two bits.
I meant here as a start bit index for _PAGE_SOFT.

> 
> > +/*
> > + * There is no such bits in PTE format for RISC-V.
> > + *
> > + * _PAGE_BLOCK was introduced to have ability to tell that
> > superpage
> > + * should be allocated.
> > + */
> > +#define _PAGE_BLOCK         BIT(9, UL)
> 
> Imo, like on x86, super-page mappings should be the default whenever
> possible.
> Callers _not_ wanting such can request so - see x86'es
> MAP_SMALL_PAGES.
> 
> > +#define _PAGE_W_BIT     2
> > +#define _PAGE_XN_BIT    3
> > +#define _PAGE_RO_BIT    1
> > +
> > +/* TODO: move to somewhere generic part/header ? */
> > +#define _PAGE_XN        (1U << _PAGE_XN_BIT)
> > +#define _PAGE_RO        (1U << _PAGE_RO_BIT)
> > +#define _PAGE_W         (1U << _PAGE_W_BIT)
> > +#define PAGE_XN_MASK(x) (((x) >> _PAGE_XN_BIT) & 0x1U)
> > +#define PAGE_RO_MASK(x) (((x) >> _PAGE_RO_BIT) & 0x1U)
> > +#define PAGE_W_MASK(x)  (((x) >> _PAGE_W_BIT) & 0x1U)
> 
> What are all of these about? Why PAGE_XN when you have a positive X
> bit in the
> PTEs? And why 0x1U when plain 1 would do? All the PAGE_*_MASK ones
> are also
> badly named. Constructs of that name should extract the bit in its
> position
> (i.e. not shifted to bit 0), or be the name of a constant to use to
> do so.
The same reason as above initially I tried to have generic code with
ARM but it at the end it was a bad idea. So just need to provide proper
names. Thanks.


> 
> > --- a/xen/arch/riscv/include/asm/page.h
> > +++ b/xen/arch/riscv/include/asm/page.h
> > @@ -34,6 +34,7 @@
> >  #define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> >  #define PTE_TABLE                   (PTE_VALID)
> >  
> > +#define PAGE_HYPERVISOR_RO          (PTE_VALID | PTE_READABLE)
> >  #define PAGE_HYPERVISOR_RW          (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> 
> No PAGE_HYPERVISOR_RX?
I haven't used it at the moment, so I haven't provided it.

> 
> > @@ -43,13 +44,68 @@
> >  
> >  #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) &
> > VPN_MASK)
> >  
> > -/* Page Table entry */
> > +#define FIRST_SIZE (XEN_PT_LEVEL_SIZE(2))
> > +
> > +#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & ((_AC(1, U)
> > << PAGETABLE_ORDER) - 1))
> > +
> > +#if RV_STAGE1_MODE > SATP_MODE_SV48
> > +#error "need to to update DECLARE_OFFSETS macros"
> > +#else
> > +
> > +#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
> > +#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
> > +#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
> > +#define l3_table_offset(va) TABLE_OFFSET(pt_linear_offset(3, va))
> > +
> > +/* Generate an array @var containing the offset for each level
> > from @addr */
> > +#define DECLARE_OFFSETS(var, addr)          \
> > +    const unsigned int var[4] = {           \
> > +        l0_table_offset(addr),              \
> > +        l1_table_offset(addr),              \
> > +        l2_table_offset(addr),              \
> > +        l3_table_offset(addr)               \
> > +    }
> 
> Why would this need to have 4 entries in SV39 mode?
I don't need for Sv39. I just wasn't sure with the better way to have
universal DECLARE_OFFSETS() definition. I will add #ifdef Sv48 around
l3_table_offset and drop 4 in const unsigned int var[4].

> 
> > +#endif
> > +
> >  typedef struct {
> > +    unsigned long v:1;
> > +    unsigned long r:1;
> > +    unsigned long w:1;
> > +    unsigned long x:1;
> > +    unsigned long u:1;
> > +    unsigned long g:1;
> > +    unsigned long a:1;
> > +    unsigned long d:1;
> 
> bool for all of these?
> 
> > +    unsigned long rsw:2;
> > +#if RV_STAGE1_MODE == SATP_MODE_SV39
> > +    unsigned long ppn0:9;
> > +    unsigned long ppn1:9;
> > +    unsigned long ppn2:26;
> > +    unsigned long rsw2:7;
> 
> "rsw" above appear to mean "reserved for software use". What does
> "rsw" here
> mean? Should this field be "rsv"?
rsw2 means:
   Bits 60–54 are reserved for future standard
   use and, until their use is defined by some standard extension, must be
   zeroed by software for
   forward compatibility. If any of these bits are set, a page-fault
   exception is raised.
It would be better to use "rsv" Thanks.

> 
> > +    unsigned long pbmt:2;
> > +    unsigned long n:1;
> > +#elif RV_STAGE1_MODE == SATP_MODE_SV48
> > +    unsigned long ppn0:9;
> > +    unsigned long ppn1:9;
> > +    unsigned long ppn2:9;
> > +    unsigned long ppn3:17;
> > +    unsigned long rsw2:7;
> > +    unsigned long pbmt:2;
> > +    unsigned long n:1;
> > +#else
> > +#error "Add proper bits for SATP_MODE"
> > +#endif
> > +} pt_t;
> 
> I can't spot a use anywhere of e.g. ppn0. Would be nice to understand
> in
> what contexts these bitfields are going to be used. I notice you
> specifically
> don't use them in e.g. pte_is_table().
I don't use them at the moment. I just introduced them for the possible
future using. I can re-check what of them I am using in my private
branches and come up here only with that one which are really used.

> 
> 
> > +}
> > +
> > +static inline bool pte_is_mapping(const pte_t pte, unsigned int
> > level)
> > +{
> > +    return !pte_is_table(pte, level);
> > +}
> 
> Don't you mean V=1 and (R=1 or X=1) here?
Agree, (R=1 or X=1) should be checked here too.

> 
> > +/* Sanity check of the entry */
> > +static bool xen_pt_check_entry(pte_t entry, mfn_t mfn, unsigned
> > int level,
> > +                               unsigned int flags)
> 
> The comment wants extending to indicate what the parameters mean wrt
> what
> is going to be checked. For example, ...
> 
> > +{
> > +    /* Sanity check when modifying an entry. */
> > +    if ( mfn_eq(mfn, INVALID_MFN) )
> 
> ... it's unclear to me why incoming INVALID_MFN would indicate
> modification
> of an entry, whereas further down _PAGE_PRESENT supposedly indicates
> insertion.
The statement inside if isn't correct. It should be:
   if ( (flags & _PAGE_PRESENT) && mfn_eq(mfn, INVALID_MFN) )

INVALID_MFN indicates modification because of how xen_pt_update() is
used:
   int map_pages_to_xen(unsigned long virt,
                        mfn_t mfn,
                        unsigned long nr_mfns,
                        unsigned int flags)
   {
       return xen_pt_update(virt, mfn, nr_mfns, flags);
   }
   
   int __init populate_pt_range(unsigned long virt, unsigned long nr_mfns)
   {
       return xen_pt_update(virt, INVALID_MFN, nr_mfns, _PAGE_PRESENT);
   }
   
   int destroy_xen_mappings(unsigned long s, unsigned long e)
   {
       ASSERT(IS_ALIGNED(s, PAGE_SIZE));
       ASSERT(IS_ALIGNED(e, PAGE_SIZE));
       ASSERT(s <= e);
       return xen_pt_update(s, INVALID_MFN, (e - s) >> PAGE_SHIFT, 0);
   }
   
   int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int
   nf)
   {
       ASSERT(IS_ALIGNED(s, PAGE_SIZE));
       ASSERT(IS_ALIGNED(e, PAGE_SIZE));
       ASSERT(s <= e);
       return xen_pt_update(s, INVALID_MFN, (e - s) >> PAGE_SHIFT, nf);
   }

The idea is the following:
  the MFN is not valid and we are not populating page table. This means
we either modify entry or remove an entry.

> 
> > +    {
> > +        /* We don't allow modifying an invalid entry. */
> > +        if ( !pte_is_valid(entry) )
> > +        {
> > +            printk("Modifying invalid entry is not allowed.\n");
> > +            return false;
> > +        }
> > +
> > +        /* We don't allow modifying a table entry */
> > +        if ( !pte_is_mapping(entry, level) )
> 
> With what the comment say, why not pte_is_table()?

It should be pte_is_table(). I will double check, thanks.

> 
> > +        {
> > +            printk("Modifying a table entry is not allowed.\n");
> > +            return false;
> > +        }
> > +    }
> > +    /* Sanity check when inserting a mapping */
> > +    else if ( flags & _PAGE_PRESENT )
> > +    {
> > +        /* We should be here with a valid MFN. */
> > +        ASSERT(!mfn_eq(mfn, INVALID_MFN));
> > +
> > +        /*
> > +         * We don't allow replacing any valid entry.
> > +         *
> > +         * Note that the function xen_pt_update() relies on this
> > +         * assumption and will skip the TLB flush. The function
> > will need
> > +         * to be updated if the check is relaxed.
> > +         */
> > +        if ( pte_is_valid(entry) )
> > +        {
> > +            if ( pte_is_mapping(entry, level) )
> > +                printk("Changing MFN for a valid entry is not
> > allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
> > +                          mfn_x(pte_get_mfn(entry)), mfn_x(mfn));
> 
> Nit: Indentation. (I'm once again worried by all of these printk()s
> anyway.)
> 
> > +#define XEN_TABLE_MAP_FAILED 0
> > +#define XEN_TABLE_SUPER_PAGE 1
> > +#define XEN_TABLE_NORMAL_PAGE 2
> > +
> > +/*
> > + * Take the currently mapped table, find the corresponding entry,
> > + * and map the next table, if available.
> > + *
> > + * The read_only parameters indicates whether intermediate tables
> > should
> > + * be allocated when not present.
> 
> "read_only" would have a different meaning to me. Maybe use inverted
> sense
> with a name like "alloc"?
Sure, alloc_only would be better correspond to the comment.

> 
> > + * Return values:
> > + *  XEN_TABLE_MAP_FAILED: Either read_only was set and the entry
> > + *  was empty, or allocating a new page failed.
> > + *  XEN_TABLE_NORMAL_PAGE: next level mapped normally
> > + *  XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
> > + */
> > +static int xen_pt_next_level(bool read_only, unsigned int level,
> > +                             pte_t **table, unsigned int offset)
> > +{
> > +    pte_t *entry;
> > +    int ret;
> > +    mfn_t mfn;
> > +
> > +    entry = *table + offset;
> > +
> > +    if ( !pte_is_valid(*entry) )
> > +    {
> > +        if ( read_only )
> > +            return XEN_TABLE_MAP_FAILED;
> > +
> > +        ret = create_xen_table(entry);
> > +        if ( ret )
> > +            return XEN_TABLE_MAP_FAILED;
> > +    }
> > +
> > +    if ( pte_is_mapping(*entry, level) )
> > +    {
> > +        return XEN_TABLE_SUPER_PAGE;
> > +    }
> 
> Please be consistent with braces around single statements.
> 
> > +    mfn = pte_get_mfn(*entry);
> > +
> > +    xen_unmap_table(*table);
> > +    *table = xen_map_table(mfn);
> > +
> > +    return XEN_TABLE_NORMAL_PAGE;
> > +}
> 
> Normal page? Not normal table?
It just mean that this points not to super_page so potenially the in
the next one table will have an entry that would be normal page.

> 
> > +/* Update an entry at the level @target. */
> > +static int xen_pt_update_entry(mfn_t root, unsigned long virt,
> > +                               mfn_t mfn, unsigned int
> > arch_target,
> > +                               unsigned int flags)
> > +{
> > +    int rc;
> > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > +    unsigned int arch_level = level;
> 
> Why two level variables?
It is not needed anymore, I will drop it.

> 
> > +    unsigned int target = arch_target;
> > +    pte_t *table;
> > +    /*
> > +     * The intermediate page tables are read-only when the MFN is
> > not valid
> > +     * This means we either modify permissions or remove an entry.
> > +     */
> > +    bool read_only = mfn_eq(mfn, INVALID_MFN);
> 
> I'm afraid I can't make a connection between the incoming MFN being
> INVALID_MFN and intermediate tables being intended to remain
> unaltered.

It is becuase of xen_pt_update() is used:
   int __init populate_pt_range(unsigned long virt, unsigned long
   nr_mfns)
   {
       return xen_pt_update(virt, INVALID_MFN, nr_mfns, _PAGE_PRESENT);
   }
So if pt is only populated then they are read_only and so they shouldn't
be allocated what means ptes are only or being removed or modified.

> > +
> > +static int xen_pt_update(unsigned long virt,
> > +                         mfn_t mfn,
> > +                         /* const on purpose as it is used for TLB
> > flush */
> > +                         const unsigned long nr_mfns,
> 
> I'm afraid I don't see what the comment is actually trying to tell
> me.
> If this is about you wanting to make sure the function arguments
> aren't
> altered prematurely, then why would the same not apply to virt, mfn,
> and flags (all of which matter at the time the TLB flush is
> initiated)?
> 
> > +                         unsigned int flags)
> > +{
> > +    int rc = 0;
> > +    unsigned long vfn = virt >> PAGE_SHIFT;
> > +    unsigned long left = nr_mfns;
> > +
> > +    const mfn_t root = get_root_page();
> > +
> > +    /*
> > +     * It is bad idea to have mapping both writeable and
> > +     * executable.
> > +     * When modifying/creating mapping (i.e _PAGE_PRESENT is set),
> > +     * prevent any update if this happen.
> > +     */
> > +    if ( (flags & _PAGE_PRESENT) && !PAGE_RO_MASK(flags) &&
> > +         !PAGE_XN_MASK(flags) )
> > +    {
> > +        printk("Mappings should not be both Writeable and
> > Executable.\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> > +    {
> > +        printk("The virtual address is not aligned to the page-
> > size.\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    spin_lock(&xen_pt_lock);
> > +
> > +    while ( left )
> > +    {
> > +        unsigned int order, level;
> > +
> > +        level = xen_pt_mapping_level(vfn, mfn, left, flags);
> > +        order = XEN_PT_LEVEL_ORDER(level);
> > +
> > +        ASSERT(left >= BIT(order, UL));
> > +
> > +        rc = xen_pt_update_entry(root, vfn << PAGE_SHIFT, mfn,
> > level,
> > +                                    flags);
> > +        if ( rc )
> > +            break;
> > +
> > +        vfn += 1U << order;
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            mfn = mfn_add(mfn, 1U << order);
> > +
> > +        left -= (1U << order);
> > +
> > +        if ( rc )
> > +            break;
> > +    }
> > +
> > +    /*
> > +     * The TLBs flush can be safely skipped when a mapping is
> > inserted
> > +     * as we don't allow mapping replacement (see
> > xen_pt_check_entry()).
> > +     *
> > +     * For all the other cases, the TLBs will be flushed
> > unconditionally
> > +     * even if the mapping has failed. This is because we may have
> > +     * partially modified the PT. This will prevent any unexpected
> > +     * behavior afterwards.
> > +     */
> > +    if ( !((flags & _PAGE_PRESENT) && !mfn_eq(mfn, INVALID_MFN)) )
> > +        flush_xen_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
> 
> See my comments elsewhere towards TLB behavior on RISC-V. The
> indicated
> skipping of a flush is safe only if not-present entries cannot be put
> in the TLB.
> 
> > +    spin_unlock(&xen_pt_lock);
> > +
> > +    return rc;
> > +}
> > +
> > +int map_pages_to_xen(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    return xen_pt_update(virt, mfn, nr_mfns, flags);
> > +}
> 
> Why this wrapping of two functions taking identical arguments?
map_pages_to_xen() sounds more clear regarding the way how it should be
used.

xen_pt_update() will be also used inside other functions. Look at the
example above.

~ Oleksii



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.