[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 6/7] xen/riscv: page table handling



On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
> On 14.08.2024 18:50, oleksii.kurochko@xxxxxxxxx wrote:
> > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > Introduce internal macros starting with PTE_* for convenience.
> > > > These macros closely resemble PTE bits, with the exception of
> > > > PTE_BLOCK, which indicates that a page larger than 4KB is
> > > > needed.
> > > 
> > > I did comment on this too, iirc: Is there going to be any case
> > > where
> > > large pages are going to be "needed", i.e. not just preferred? If
> > > not,
> > > giving the caller control over things the other way around
> > > (requesting
> > > 4k mappings are needed, as we have it in x86) may be preferable.
> > Yes, you did the comment but I thought that it will be enough to
> > mention that shattering isn't supported now and  also since
> > pt_update_entry()is used to unmap as well, there could be a need to
> > unmap e.g. 4K page from 2M block mapping what will a little bit
> > harder
> > then just having 4k by default.
> 
> Shattering isn't supported now, but that's going to change at some
> point,
> I suppose. Where possible the long-term behavior wants taking into
> account
> right away, to avoid having to e.g. touch all callers again later on.
Arm still leaves without shattering support for Xen pages:
https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/mmu/pt.c?ref_type=heads#L454

So it can be pretty long-term behaviour.

> 
> > > Hmm, but then ...
> > > 
> > > > RISC-V detects superpages using `pte.x` and `pte.r`, as there
> > > > is no specific bit in the PTE for this purpose. From the RISC-V
> > > > spec:
> > > > ```
> > > >   ...
> > > >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go
> > > > to
> > > > step 5.
> > > >      Otherwise, this PTE is a pointer to the next level of the
> > > > page
> > > > table.
> > > >      ... .
> > > >   5. A leaf PTE has been found.
> > > >      ...
> > > >   ...
> > > > ```
> > > > 
> > > > The code doesn’t support super page shattering so 4KB pages are
> > > > used as
> > > > default.
> > > 
> > > ... you have this. Yet still callers expecting re-mapping in the
> > > (large)
> > > range they map can request small-page mappings right away.
> > I am not sure that I fully understand what do you mean by "expcting
> > re-
> > mapping".
> 
> Right now you have callers pass PTE_BLOCK when they know that no
> small
> page re-mappings are going to occur for an area. What I'm suggesting
> is
> that you invert this logic: Have callers pass PTE_SMALL when there is
> a possibility that re-mapping requests may be issued later. Then,
> later, by simply grep-ing for PTE_SMALL you'll be able to easily find
> all candidates that possibly can be relaxed when super-page
> shattering
> starts being supported. That's going to be easier than finding all
> instances where PTE_BLOCK is _not_used.
So if I understand correctly. Actually nothing will change in algorithm
of pt_update() and only PTE_SMALL should be introduced instead of
PTE_BLOCK. And if I will know that something will be better to map as
PTE_SMALL to not face shattering in case of unmap (for example) I just
can map this memory as PTE_SMALL and that is it?

> > > > --- /dev/null
> > > > +++ b/xen/arch/riscv/pt.c
> > > > @@ -0,0 +1,408 @@
> > > > +#include <xen/bug.h>
> > > > +#include <xen/domain_page.h>
> > > > +#include <xen/errno.h>
> > > > +#include <xen/mm.h>
> > > > +#include <xen/mm-frame.h>
> > > > +#include <xen/pmap.h>
> > > > +#include <xen/spinlock.h>
> > > > +
> > > > +#include <asm/flushtlb.h>
> > > > +#include <asm/page.h>
> > > > +
> > > > +static inline const mfn_t get_root_page(void)
> > > > +{
> > > > +    unsigned long root_maddr =
> > > 
> > > maddr_t or paddr_t?
> > > 
> > > > +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> > > > +
> > > > +    return maddr_to_mfn(root_maddr);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Sanity check of the entry
> > > > + * mfn is not valid and we are not populating page table. This
> > > > means
> > > 
> > > How does this fit with ...
> > > 
> > > > + * we either modify entry or remove an entry.
> > > > + */
> > > > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned
> > > > int
> > > > flags)
> > > > +{
> > > > +    /* Sanity check when modifying an entry. */
> > > > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> > > 
> > > ... the MFN check here?
> > The comment is incorrect and should be corrected:
> > " if mfn is valid or ... "
> > 
> > >  And why is (valid,INVALID_MFN) an indication
> > > of a modification? But then ...
> > the explanation is in the comment to pt_update():
> >    /*
> >     * If `mfn` equals `INVALID_MFN`, it indicates that the
> > following page
> >    table
> >     * update operation might be related to either populating the
> > table,
> >     * destroying a mapping, or modifying an existing mapping.
> >     */
> >    static int pt_update(unsigned long virt,
> 
> And how do readers know that comments in pt_update() are crucial for
> understanding what pt_check_entry() does? You certainly don't need to
> have the same comment in two places, but you at least want to refer
> to a relevant comment when that lives elsewhere.
Sure, I will update the comment in pt_check_entry() properly if this
function still makes any sense.

> 
> > > > +static int pt_update(unsigned long virt,
> > > > +                     mfn_t mfn,
> > > > +                     unsigned long nr_mfns,
> > > > +                     unsigned int flags)
> > > > +{
> > > > +    int rc = 0;
> > > > +    unsigned long vfn = virt >> PAGE_SHIFT;
> > > > +    unsigned long left = nr_mfns;
> > > > +
> > > > +    const mfn_t root = get_root_page();
> > > > +
> > > > +    /*
> > > > +     * It is bad idea to have mapping both writeable and
> > > > +     * executable.
> > > > +     * When modifying/creating mapping (i.e PTE_VALID is set),
> > > > +     * prevent any update if this happen.
> > > > +     */
> > > > +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
> > > > PTE_X_MASK(flags) )
> > > > +    {
> > > > +        printk("Mappings should not be both Writeable and
> > > > Executable.\n");
> > > > +        return -EINVAL;
> > > > +    }
> > > > +
> > > > +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> > > > +    {
> > > > +        printk("The virtual address is not aligned to the
> > > > page-
> > > > size.\n");
> > > > +        return -EINVAL;
> > > > +    }
> > > > +
> > > > +    spin_lock(&xen_pt_lock);
> > > > +
> > > > +    while ( left )
> > > > +    {
> > > > +        unsigned int order, level;
> > > > +
> > > > +        level = pt_mapping_level(vfn, mfn, left, flags);
> > > > +        order = XEN_PT_LEVEL_ORDER(level);
> > > > +
> > > > +        ASSERT(left >= BIT(order, UL));
> > > > +
> > > > +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn,
> > > > level,
> > > > +                                    flags);
> > > 
> > > Indentation.
> > > 
> > > > +        if ( rc )
> > > > +            break;
> > > > +
> > > > +        vfn += 1U << order;
> > > > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > > > +            mfn = mfn_add(mfn, 1U << order);
> > > > +
> > > > +        left -= (1U << order);
> > > 
> > > To be on thje safe side, 1UL everywhere?
> > > 
> > > > +        if ( rc )
> > > > +            break;
> > > 
> > > There was such a check already a few lines up from here.
> > > 
> > > > +    }
> > > > +
> > > > +
> > > > +    /* ensure that PTEs are all updated before flushing */
> > > 
> > > Again: No double blank lines please.
> > > 
> > > > +    RISCV_FENCE(rw, rw);
> > > > +
> > > > +    /*
> > > > +     * always flush TLB at the end of the function as non-
> > > > present
> > > > entries
> > > > +     * can be put in the TLB
> > > > +     */
> > > > +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
> > > 
> > > Coming back to "negative" TLB entries: Assuming RISC-V, just like
> > > other
> > > architectures, also permits intermediate page table entries to be
> > > cached,
> > > the affected VA range may be much larger than the original
> > > request,
> > > if
> > > intermediate page tables needed creating.
> > It could be an issue. Could we some how  to calculate the proper
> > range
> > or the only option we have is to flush all.
> 
> Right - either you maintain state to know the biggest possible range
> that can be affected, or you flush all when a new intermediate page
> table needed inserting.
I think that the second one option will be easier to implement in the
current implementation. It is not issue for now as fixmap, fdt and xen
are in the same slot so no new intermediate page tables are needed.

> 
> > > > +    spin_unlock(&xen_pt_lock);
> > > 
> > > Does this really need to come after fence and flush?
> > I think yes, as page table should be updated only by 1 CPU at the
> > same
> > time. And before give ability to other CPU to update page table we
> > have
> > to finish a work on current CPU.
> 
> Can you then explain to me, perhaps by way of an example, what will
> go
> wrong if the unlock is ahead of the flush? (I'm less certain about
> the
> fence, and that's also less expensive.)
pt_update() will be called for interleaved region, for example, by
different CPUs:

                     pt_update():
CPU1:                                    CPU2:
 ...                                spin_lock(&xen_pt_lock);
RISCV_FENCE(rw, rw);                 ....

/* After this function will be
   executed the following thing
   can happen ------------------>  start to update page table
*/                                 entries which was partially      
spin_unlock(&xen_pt_lock);         created during CPU1 but CPU2       
....                               doesn't know about them yet        
....                               because flush_tlb() ( sfence.vma ) 
....                               wasn't done      
....                                                                  
flush_tlb_range_va();

And it can be an issue if I understand correctly.
> 
> > > > +int map_pages_to_xen(unsigned long virt,
> > > > +                     mfn_t mfn,
> > > > +                     unsigned long nr_mfns,
> > > > +                     unsigned int flags)
> > > > +{
> > > > +    /*
> > > > +     * Ensure that we have a valid MFN before proceeding.
> > > > +     *
> > > > +     * If the MFN is invalid, pt_update() might misinterpret
> > > > the
> > > > operation,
> > > > +     * treating it as either a population, a mapping
> > > > destruction,
> > > > +     * or a mapping modification.
> > > > +     */
> > > > +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
> > > 
> > > But flags with PTE_VALID not set are fine to come into here?
> > It is fine for pt_update() but I don't know if it is fine for
> > map_pages_to_xen(). I see that other architectures don't check
> > that.
> 
> That's not my point here. It's rather along the lines of an earlier
> that I gave here: Since pte_update() is a pretty generic function,
> will
> flags not having PTE_VALID set perhaps result in pte_update() doing
> something that's not what the caller might expect?
I think that everything will be okay, if PTE_VALID is set then it means
that pt_update() should update ( modify/remove/insert ) page table
entry and all the cases which isn't expected by the logic should be
covered by pt_check_entry().

and the case if when page table couldn't be mapped:
```
           rc = pt_next_level(alloc_only, &table, offsets[level]);
           if ( rc == XEN_TABLE_MAP_FAILED )
           {
               rc = 0;
   
               /*
                * We are here because pt_next_level has failed to map
                * the intermediate page table (e.g the table does not
   exist
                * and the pt is read-only). It is a valid case when
                * removing a mapping as it may not exist in the page
   table.
                * In this case, just ignore it.
                */
               if ( flags & PTE_VALID )
               {
                   printk("%s: Unable to map level %u\n", __func__,
   level);
                   rc = -ENOENT;
               }
```
> 
> And yes, there's a special case of map_pages_to_xen() being called
> with
> _PAGE_NONE (if an arch defines such). That special case plays into
> here:
> If an arch doesn't define it, unmap requests ought to exclusively
> come
> through destroy_xen_mappings().
I thought that it should always done through destroy_xen_mappings().

Arm doesn't introduce _PAGE_NONE and pt_update() is based on Arm's
version of xen_pt_update() so this special case should be covered
properly.

And it seems to me (if I am not confusing something ) that if it is
necessary to unmap pages mapped by map_pages_to_xen() they are using
destroy_xen_mappings() which is defined using xen_pt_update():
   int modify_xen_mappings(unsigned long s, unsigned long e, unsigned
   int nf)
   {
       ASSERT(IS_ALIGNED(s, PAGE_SIZE));
       ASSERT(IS_ALIGNED(e, PAGE_SIZE));
       ASSERT(s <= e);
       return xen_pt_update(s, INVALID_MFN, (e - s) >> PAGE_SHIFT, nf);
   }

~ Oleksii



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.