[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 6/7] xen/riscv: page table handling



On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > Implement map_pages_to_xen() which requires several
> > functions to manage page tables and entries:
> > - pt_update()
> > - pt_mapping_level()
> > - pt_update_entry()
> > - pt_next_level()
> > - pt_check_entry()
> > 
> > To support these operations, add functions for creating,
> > mapping, and unmapping Xen tables:
> > - create_xen_table()
> > - xen_map_table()
> > - xen_unmap_table()
> 
> I think I commented on this before: Everything is "Xen" in hypervisor
> code. What I think you mean is to map/unmap Xen's own page tables.
> Naming-wise that would be {,un}map_xen_table(), though. Since they
> are static, just {,un}map_table() ought to be unambiguous, too.
I thought that your comment was about pt_*() functions but thanks for
explanation again.

> 
> > Introduce internal macros starting with PTE_* for convenience.
> > These macros closely resemble PTE bits, with the exception of
> > PTE_BLOCK, which indicates that a page larger than 4KB is
> > needed.
> 
> I did comment on this too, iirc: Is there going to be any case where
> large pages are going to be "needed", i.e. not just preferred? If
> not,
> giving the caller control over things the other way around
> (requesting
> 4k mappings are needed, as we have it in x86) may be preferable.
Yes, you did the comment but I thought that it will be enough to
mention that shattering isn't supported now and  also since
pt_update_entry()is used to unmap as well, there could be a need to
unmap e.g. 4K page from 2M block mapping what will a little bit harder
then just having 4k by default.

> 
> Hmm, but then ...
> 
> > RISC-V detects superpages using `pte.x` and `pte.r`, as there
> > is no specific bit in the PTE for this purpose. From the RISC-V
> > spec:
> > ```
> >   ...
> >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to
> > step 5.
> >      Otherwise, this PTE is a pointer to the next level of the page
> > table.
> >      ... .
> >   5. A leaf PTE has been found.
> >      ...
> >   ...
> > ```
> > 
> > The code doesn’t support super page shattering so 4KB pages are
> > used as
> > default.
> 
> ... you have this. Yet still callers expecting re-mapping in the
> (large)
> range they map can request small-page mappings right away.
I am not sure that I fully understand what do you mean by "expcting re-
mapping".

> 
> > --- a/xen/arch/riscv/include/asm/flushtlb.h
> > +++ b/xen/arch/riscv/include/asm/flushtlb.h
> > @@ -5,12 +5,25 @@
> >  #include <xen/bug.h>
> >  #include <xen/cpumask.h>
> >  
> > +#include <asm/sbi.h>
> > +
> >  /* Flush TLB of local processor for address va. */
> >  static inline void flush_tlb_one_local(vaddr_t va)
> >  {
> >      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
> >  }
> >  
> > +/*
> > + * Flush a range of VA's hypervisor mappings from the TLB of all
> > + * processors in the inner-shareable domain.
> > + */
> > +static inline void flush_tlb_range_va(vaddr_t va,
> > +                                      size_t size)
> 
> No need for line wrapping here?
What is line wrapping here? Do you mean that size_t size should be on
the previous line?

> 
> > @@ -33,15 +38,72 @@
> >  #define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> >  #define PTE_TABLE                   (PTE_VALID)
> >  
> > +#define PAGE_HYPERVISOR_RO          (PTE_VALID | PTE_READABLE)
> >  #define PAGE_HYPERVISOR_RW          (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> > +#define PAGE_HYPERVISOR_RX          (PTE_VALID | PTE_READABLE |
> > PTE_EXECUTABLE)
> >  
> >  #define PAGE_HYPERVISOR             PAGE_HYPERVISOR_RW
> >  
> > +
> > +/*
> 
> Nit: As before, no double blank lines please.
> 
> > + * There are no such bits in PTE format for RISC-V.
> 
> This is an odd way to start a comment: There's nothing for "such" to
> refer
> to.
> 
> > + * The code doesn’t support super page shattering so at the moment
> > superpages
> > + * can't be used as a default so PTE_BLOCK is introduced to have
> > ability to
> > + * tell that superpage should be allocated.
> > + * Additionaly as mentioed in RISC-V priviliged spec:
> > + * ```
> > + *  After much deliberation, we have settled on a conventional
> > page size of
> > + *  4 KiB for both RV32 and RV64. We expect this decision to ease
> > the porting
> > + *  of low-level runtime software and device drivers.
> > + *
> > + *  The TLB reach problem is ameliorated by transparent superpage
> > support in
> > + *  modern operating systems [2]. Additionally, multi-level TLB
> > hierarchies
> > + *  are quite inexpensive relative to the multi-level cache
> > hierarchies whose
> > + *  address space they map.
> > + *
> > + *  [2] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox.
> > Practical,
> > + *      transparent operating system support for superpages.
> > + *      SIGOPS Oper. Syst. Rev., 36(SI):89–104, December 2002.
> > + * ```
> > + *
> > + * PTE_POPULATE is introduced to have ability to tell that page
> > tables
> > + * shoud be populated.
> > + */
> > +#define PTE_BLOCK       BIT(10, UL)
> > +#define PTE_POPULATE    BIT(11, UL)
> > +
> > +#define PTE_R_MASK(x)   ((x) & PTE_READABLE)
> > +#define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
> > +#define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
> > +
> > +#define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
> > PTE_EXECUTABLE))
> > +
> >  /* Calculate the offsets into the pagetables for a given VA */
> >  #define pt_linear_offset(lvl, va)   ((va) >>
> > XEN_PT_LEVEL_SHIFT(lvl))
> >  
> >  #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) &
> > VPN_MASK)
> >  
> > +#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & ((_AC(1, U)
> > << PAGETABLE_ORDER) - 1))
> 
> Not: Too long line.
> 
> > +#if RV_STAGE1_MODE > SATP_MODE_SV48
> 
> SV48? Isn't ...
> 
> > +#error "need to to update DECLARE_OFFSETS macros"
> > +#else
> > +
> > +#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
> > +#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
> > +#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
> > +
> > +/* Generate an array @var containing the offset for each level
> > from @addr */
> > +#define DECLARE_OFFSETS(var, addr)          \
> > +    const unsigned int var[] = {            \
> > +        l0_table_offset(addr),              \
> > +        l1_table_offset(addr),              \
> > +        l2_table_offset(addr),              \
> > +    }
> 
> ... this for SV39?
Agree, the check above isn't correct. It should be "RV_STAGE1_MODE >
SATP_MODE_SV39".


> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/pt.c
> > @@ -0,0 +1,408 @@
> > +#include <xen/bug.h>
> > +#include <xen/domain_page.h>
> > +#include <xen/errno.h>
> > +#include <xen/mm.h>
> > +#include <xen/mm-frame.h>
> > +#include <xen/pmap.h>
> > +#include <xen/spinlock.h>
> > +
> > +#include <asm/flushtlb.h>
> > +#include <asm/page.h>
> > +
> > +static inline const mfn_t get_root_page(void)
> > +{
> > +    unsigned long root_maddr =
> 
> maddr_t or paddr_t?
> 
> > +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> > +
> > +    return maddr_to_mfn(root_maddr);
> > +}
> > +
> > +/*
> > + * Sanity check of the entry
> > + * mfn is not valid and we are not populating page table. This
> > means
> 
> How does this fit with ...
> 
> > + * we either modify entry or remove an entry.
> > + */
> > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
> > flags)
> > +{
> > +    /* Sanity check when modifying an entry. */
> > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> 
> ... the MFN check here?
The comment is incorrect and should be corrected:
" if mfn is valid or ... "

>  And why is (valid,INVALID_MFN) an indication
> of a modification? But then ...
the explanation is in the comment to pt_update():
   /*
    * If `mfn` equals `INVALID_MFN`, it indicates that the following page
   table
    * update operation might be related to either populating the table,
    * destroying a mapping, or modifying an existing mapping.
    */
   static int pt_update(unsigned long virt,

> 
> > +    {
> > +        /* We don't allow modifying an invalid entry. */
> > +        if ( !pte_is_valid(entry) )
> > +        {
> > +            printk("Modifying invalid entry is not allowed.\n");
> > +            return false;
> > +        }
> 
> ... I also don't understand what this is about. IOW I'm afraid I'm
> still confused about the purpose of this function as well as the
> transitions you want to permit / reject. I wonder whether the
> flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.
> 
> > +/* Update an entry at the level @target. */
> > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > +                           mfn_t mfn, unsigned int target,
> > +                           unsigned int flags)
> > +{
> > +    int rc;
> > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > +    pte_t *table;
> > +    /*
> > +     * The intermediate page tables are read-only when the MFN is
> > not valid
> > +     * and we are not populating page table.
> 
> The way flags are handled in PTEs, how can page tables be read-only?
I started to be confused. Probably I have to re-write some code and
also drop almost the whole function xen_pt_check_entry().

> 
> > +     * This means we either modify permissions or remove an entry.
> 
> From all I can determine we also get here when making brand new
> entries.
> What am I overlooking?
Nothing. then it means intermidiate page table won't be read-only.

> 
> > +     */
> > +    bool alloc_only = mfn_eq(mfn, INVALID_MFN) && !(flags &
> > PTE_POPULATE);
> > +    pte_t pte, *entry;
> > +
> > +    /* convenience aliases */
> > +    DECLARE_OFFSETS(offsets, virt);
> > +
> > +    table = xen_map_table(root);
> > +    for ( ; level > target; level-- )
> > +    {
> > +        rc = pt_next_level(alloc_only, &table, offsets[level]);
> > +        if ( rc == XEN_TABLE_MAP_FAILED )
> > +        {
> > +            rc = 0;
> > +
> > +            /*
> > +             * We are here because pt_next_level has failed to map
> > +             * the intermediate page table (e.g the table does not
> > exist
> > +             * and the pt is read-only). It is a valid case when
> > +             * removing a mapping as it may not exist in the page
> > table.
> > +             * In this case, just ignore it.
> > +             */
> > +            if ( flags & PTE_VALID )
> > +            {
> > +                printk("%s: Unable to map level %u\n", __func__,
> > level);
> > +                rc = -ENOENT;
> > +            }
> > +
> > +            goto out;
> > +        }
> > +        else if ( rc != XEN_TABLE_NORMAL )
> > +            break;
> > +    }
> > +
> > +    if ( level != target )
> > +    {
> > +        printk("%s: Shattering superpage is not supported\n",
> > __func__);
> > +        rc = -EOPNOTSUPP;
> > +        goto out;
> > +    }
> > +
> > +    entry = table + offsets[level];
> > +
> > +    rc = -EINVAL;
> > +    if ( !pt_check_entry(*entry, mfn, flags) )
> > +        goto out;
> > +
> > +    /* We are removing the page */
> > +    if ( !(flags & PTE_VALID) )
> > +        memset(&pte, 0x00, sizeof(pte));
> > +    else
> > +    {
> > +        /* We are inserting a mapping => Create new pte. */
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            pte = pte_from_mfn(mfn, PTE_VALID);
> > +        else /* We are updating the permission => Copy the current
> > pte. */
> > +            pte = *entry;
> > +
> > +        /* update permission according to the flags */
> > +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
> > +    }
> > +
> > +    write_pte(entry, pte);
> > +
> > +    rc = 0;
> > +
> > + out:
> > +    xen_unmap_table(table);
> > +
> > +    return rc;
> > +}
> > +
> > +static DEFINE_SPINLOCK(xen_pt_lock);
> 
> If you put this in the middle of the file (which is fine), I think it
> wants putting immediately ahead of the (first) function using it, not
> at some seemingly random place.
> 
> > +/*
> > + * If `mfn` equals `INVALID_MFN`, it indicates that the following
> > page table
> > + * update operation might be related to either populating the
> > table,
> > + * destroying a mapping, or modifying an existing mapping.
> > + */
> > +static int pt_update(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    int rc = 0;
> > +    unsigned long vfn = virt >> PAGE_SHIFT;
> > +    unsigned long left = nr_mfns;
> > +
> > +    const mfn_t root = get_root_page();
> > +
> > +    /*
> > +     * It is bad idea to have mapping both writeable and
> > +     * executable.
> > +     * When modifying/creating mapping (i.e PTE_VALID is set),
> > +     * prevent any update if this happen.
> > +     */
> > +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
> > PTE_X_MASK(flags) )
> > +    {
> > +        printk("Mappings should not be both Writeable and
> > Executable.\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> > +    {
> > +        printk("The virtual address is not aligned to the page-
> > size.\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    spin_lock(&xen_pt_lock);
> > +
> > +    while ( left )
> > +    {
> > +        unsigned int order, level;
> > +
> > +        level = pt_mapping_level(vfn, mfn, left, flags);
> > +        order = XEN_PT_LEVEL_ORDER(level);
> > +
> > +        ASSERT(left >= BIT(order, UL));
> > +
> > +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level,
> > +                                    flags);
> 
> Indentation.
> 
> > +        if ( rc )
> > +            break;
> > +
> > +        vfn += 1U << order;
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            mfn = mfn_add(mfn, 1U << order);
> > +
> > +        left -= (1U << order);
> 
> To be on thje safe side, 1UL everywhere?
> 
> > +        if ( rc )
> > +            break;
> 
> There was such a check already a few lines up from here.
> 
> > +    }
> > +
> > +
> > +    /* ensure that PTEs are all updated before flushing */
> 
> Again: No double blank lines please.
> 
> > +    RISCV_FENCE(rw, rw);
> > +
> > +    /*
> > +     * always flush TLB at the end of the function as non-present
> > entries
> > +     * can be put in the TLB
> > +     */
> > +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
> 
> Coming back to "negative" TLB entries: Assuming RISC-V, just like
> other
> architectures, also permits intermediate page table entries to be
> cached,
> the affected VA range may be much larger than the original request,
> if
> intermediate page tables needed creating.
It could be an issue. Could we some how  to calculate the proper range
or the only option we have is to flush all.
   And for some reason it isn't an issue for Arm:
   
       /*
        * The TLBs flush can be safely skipped when a mapping is
   inserted
        * as we don't allow mapping replacement (see
   xen_pt_check_entry()).
        * Although we still need an ISB to ensure any DSB in
        * write_pte() will complete because the mapping may be used
   soon
        * after.
        *
        * For all the other cases, the TLBs will be flushed
   unconditionally
        * even if the mapping has failed. This is because we may have
        * partially modified the PT. This will prevent any unexpected
        * behavior afterwards.
        */
       if ( !((flags & _PAGE_PRESENT) && !mfn_eq(mfn, INVALID_MFN)) )
           flush_xen_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
       else
           isb();
   

> 
> > +    spin_unlock(&xen_pt_lock);
> 
> Does this really need to come after fence and flush?
I think yes, as page table should be updated only by 1 CPU at the same
time. And before give ability to other CPU to update page table we have
to finish a work on current CPU.

> 
> > +    return rc;
> > +}
> > +
> > +int map_pages_to_xen(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    /*
> > +     * Ensure that we have a valid MFN before proceeding.
> > +     *
> > +     * If the MFN is invalid, pt_update() might misinterpret the
> > operation,
> > +     * treating it as either a population, a mapping destruction,
> > +     * or a mapping modification.
> > +     */
> > +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
> 
> But flags with PTE_VALID not set are fine to come into here?
It is fine for pt_update() but I don't know if it is fine for
map_pages_to_xen(). I see that other architectures don't check that.

~ Oleksii

> 
> > +    return pt_update(virt, mfn, nr_mfns, flags);
> > +}
> 
> Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.