[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 6/7] xen/riscv: page table handling



On Thu, 2024-08-29 at 09:01 +0200, Jan Beulich wrote:
> On 28.08.2024 18:11, oleksii.kurochko@xxxxxxxxx wrote:
> > On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote:
> > > On 21.08.2024 18:06, Oleksii Kurochko wrote:
> > > > Implement map_pages_to_xen() which requires several
> > > > functions to manage page tables and entries:
> > > > - pt_update()
> > > > - pt_mapping_level()
> > > > - pt_update_entry()
> > > > - pt_next_level()
> > > > - pt_check_entry()
> > > > 
> > > > To support these operations, add functions for creating,
> > > > mapping, and unmapping Xen tables:
> > > > - create_table()
> > > > - map_table()
> > > > - unmap_table()
> > > > 
> > > > Introduce internal macros starting with PTE_* for convenience.
> > > > These macros closely resemble PTE bits, with the exception of
> > > > PTE_SMALL, which indicates that 4KB is needed.
> > > 
> > > What macros are you talking about here? Is this partially stale,
> > > as
> > > only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being
> > > added?
> > I am speaking about macros connected to masks:
> >      #define PTE_R_MASK(x)   ((x) & PTE_READABLE)
> >      #define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
> >      #define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
> >    
> >      #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
> >    PTE_EXECUTABLE))
> 
> Some of which is did question further down in my reply. But what's
> worse - by saying "closely resemble PTE bits, with the exception of
> PTE_SMALL" you pretty clearly _do not_ refer to the macros above, but
> to PTE_VALID etc.
Agree, it should be corrected.

> 
> > > > @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
> > > >      return p.pte & PTE_VALID;
> > > >  }
> > > >  
> > > > +inline bool pte_is_table(const pte_t p)
> > > > +{
> > > > +    return ((p.pte & (PTE_VALID |
> > > > +                      PTE_READABLE |
> > > > +                      PTE_WRITABLE |
> > > > +                      PTE_EXECUTABLE)) == PTE_VALID);
> > > > +}
> > > 
> > > In how far is the READABLE check valid here? You (imo correctly)
> > > ...
> 
> Oh, I wrongly picked on READABLE when it should have been the
> WRITABLE
> bit.
> 
> > > > +static inline bool pte_is_mapping(const pte_t p)
> > > > +{
> > > > +    return (p.pte & PTE_VALID) &&
> > > > +           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
> > > > +}
> > > 
> > > ... don't consider this bit here.
> > pte_is_mapping() seems to me is correct as according to the RISC-V
> > privileged spec:
> >    4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to
> > step 
> >    5. Otherwise, this PTE is a pointer to the next level of the
> > page   
> >    table.
> >    5. A leaf PTE has been found. ...
> 
> Right. And then why do you check all three of r, x, and w, when the
> doc
> mentions only r and x? There may be reasons, but such reasons then
> need
> clearly stating in a code comment, for people to understand why the
> code
> is not following the spec.
So I remembered why R, W, and X are checked. There is contradictory
information about these bits
(https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1317C64-L1321C10
):
```
The permission bits, R, W, and X, indicate whether the page is
readable, writable, and executable, respectively. When all three are
zero, the PTE is a pointer to the next level of the page table;
otherwise, it is a leaf PTE.
```

However, it is also written here
(https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1539
) that only pte.r and pte.x should be checked.

I can assume that the interpretation that R=W=X=0 indicates a pointer
to the next level of the page table could come from this statement
(https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1538
):
```
If _pte_._v_ = 0, or if _pte_._r_ = 0 and _pte_._w_ = 1, or if any bits
or encodings that are reserved for future standard use are set within
_pte_, stop and raise a page-fault exception corresponding to the
original access type.
```
>From this, I can assume that when pte.r = 0, pte.w should also always
be zero; otherwise, a page-fault exception will be raised. ( but it is
no obviously connected to if the PTE is a pointer to the next page
table or not... ).




> 
> > and regarding pte_is_table() READABLE check is valid as we have to
> > check only that pte.r = pte.x = 0. WRITABLE check should be
> > dropped. Or
> > just use define pte_is_table() as:
> >    inline bool pte_is_table(const pte_t p)
> >    {
> >     return !pte_is_mapping(p);
> >    }
> 
> You had it like this earlier on, didn't you? That's wrong, because
> for a
> PTE to describe another page table level PTE_VALID needs to be set.
Agree, it's wrong, missed that.

> > > > +#define XEN_TABLE_MAP_FAILED 0
> > > > +#define XEN_TABLE_SUPER_PAGE 1
> > > > +#define XEN_TABLE_NORMAL 2
> > > > +
> > > > +/*
> > > > + * Take the currently mapped table, find the corresponding
> > > > entry,
> > > > + * and map the next table, if available.
> > > > + *
> > > > + * The alloc_tbl parameters indicates whether intermediate
> > > > tables
> > > > should
> > > > + * be allocated when not present.
> > > > + *
> > > > + * Return values:
> > > > + *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the
> > > > entry
> > > > + *  was empty, or allocating a new page failed.
> > > > + *  XEN_TABLE_NORMAL: next level or leaf mapped normally
> > > > + *  XEN_TABLE_SUPER_PAGE: The next entry points to a
> > > > superpage.
> > > > + */
> > > > +static int pt_next_level(bool alloc_tbl, pte_t **table,
> > > > unsigned
> > > > int offset)
> > > 
> > > Having the boolean first is unusual, but well - it's your choice.
> > > 
> > > > +{
> > > > +    pte_t *entry;
> > > > +    int ret;
> > > > +    mfn_t mfn;
> > > > +
> > > > +    entry = *table + offset;
> > > > +
> > > > +    if ( !pte_is_valid(*entry) )
> > > > +    {
> > > > +        if ( alloc_tbl )
> > > > +            return XEN_TABLE_MAP_FAILED;
> > > 
> > > Is this condition meant to be inverted?
> > if alloc_tbl = true we shouldn't allocatetable as:
> >      * The intermediate page table shouldn't be allocated when MFN
> > isn't
> >      * valid and we are not populating page table.
> > ...
> >     */
> 
> Well, no. The variable name really shouldn't be the opposite of what
> is
> meant. "alloc_tbl" can only possibly mean "allocate a table if none
> is
> there". I can't think of a sensible interpretation in the inverted
> sense.
> I'm curious how you mean to interpret that variable name.
My interpretation was that alloc_tbl = true means that algorithm is
trying to allocate the table what is forbidden at the moment but I
agree that your interpretation sounds more understandable based on the
variable name.

> 
> >     bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags &
> > PTE_POPULATE);
> > 
> > So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this
> > table shouldn't be allocated and thereby pt_next_level() should
> > return
> > XEN_TABLE_MAP_FAILED.
> > 
> > Or to invert if ( alloc_tbl )it will be needed to invert defintion
> > of
> > alloc_tbl:
> >  bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags &
> > PTE_POPULATE);
> 
> Yes, as I did comment further down.
> 
> > > > +    if ( level != target )
> > > > +    {
> > > > +        printk("%s: Shattering superpage is not supported\n",
> > > > __func__);
> > > > +        rc = -EOPNOTSUPP;
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    entry = table + offsets[level];
> > > > +
> > > > +    rc = -EINVAL;
> > > > +    if ( !pt_check_entry(*entry, mfn, flags) )
> > > > +        goto out;
> > > > +
> > > > +    /* We are removing the page */
> > > > +    if ( !(flags & PTE_VALID) )
> > > > +        memset(&pte, 0x00, sizeof(pte));
> > > > +    else
> > > > +    {
> > > > +        /* We are inserting a mapping => Create new pte. */
> > > > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > > > +            pte = pte_from_mfn(mfn, PTE_VALID);
> > > > +        else /* We are updating the permission => Copy the
> > > > current
> > > > pte. */
> > > > +            pte = *entry;
> > > > +
> > > > +        /* update permission according to the flags */
> > > > +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED |
> > > > PTE_DIRTY;
> > > 
> > > When updating an entry, don't you also need to clear (some of)
> > > the
> > > flags?
> > I am not sure why some flags should be cleared. Here we are taking
> > only
> > necessary for pte flags such as R, W, X or other bits in flags are
> > ignored.
> 
> Consider what happens to a PTE with R and X set when a request comes
> in
> to change to R/W. You'll end up with R, X, and W all set if you don't
> first clear the bits that are meant to be changeable in a "modify"
> operation.
That's definitely going to be a problem. I'll update the code then.

> 
> > > > +/* Return the level where mapping should be done */
> > > > +static int pt_mapping_level(unsigned long vfn, mfn_t mfn,
> > > > unsigned
> > > > long nr,
> > > > +                            unsigned int flags)
> > > > +{
> > > > +    unsigned int level = 0;
> > > > +    unsigned long mask;
> > > > +    unsigned int i;
> > > > +
> > > > +    /* Use blocking mapping unless the caller requests 4K
> > > > mapping
> > > > */
> > > > +    if ( unlikely(flags & PTE_SMALL) )
> > > > +        return level;
> > > > +
> > > > +    /*
> > > > +     * Don't take into account the MFN when removing mapping
> > > > (i.e
> > > > +     * MFN_INVALID) to calculate the correct target order.
> > > > +     *
> > > > +     * `vfn` and `mfn` must be both superpage aligned.
> > > > +     * They are or-ed together and then checked against the
> > > > size
> > > > of
> > > > +     * each level.
> > > > +     *
> > > > +     * `left` is not included and checked separately to allow
> > > > +     * superpage mapping even if it is not properly aligned
> > > > (the
> > > > +     * user may have asked to map 2MB + 4k).
> > > 
> > > What is this about? There's nothing named "left" here.
> > It refer to "remaining" pages or "leftover" space after trying to
> > align
> > a mapping to a superpage boundary.
> 
> What what is the quoted "left" here? Such a variable appears to exist
> in
> the caller, but using the name here is lacking context.
Then I will update the comment and tell from where 'left' is coming.

> 
> 
> > > > + * If `mfn` is valid and flags has PTE_VALID bit set then it
> > > > means
> > > > that
> > > > + * inserting will be done.
> > > > + */
> > > 
> > > What about mfn != INVALID_MFN and PTE_VALID clear?
> > PTE_VALID=0 will be always considered as destroying and no matter
> > what
> > is mfn value as in this case the removing is done in the way where
> > mfn
> > isn't used:
> 
> Right, yet elsewhere you're restrictive as to MFN values valid to
> use.
> Not requiring INVALID_MFN here looks inconsistent to me.
but actually if we will leave ASSERT in pt_check_entry() we will be
sure that we are here with mfn = INVALID_MFN:
       /* Sanity check when removing a mapping. */
       else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
       {
           /* We should be here with an invalid MFN. */
           ASSERT(mfn_eq(mfn, INVALID_MFN));
> 
> >         memset(&pte, 0x00, sizeof(pte));
> 
> Just to mention it: I don't think memset() is a very good way of
> clearing
> a PTE, even if right here it's not a live one.
Just direct assigning would be better? 

> 
> > >  Also note that "`mfn` is
> > > valid" isn't the same as "mfn != INVALID_MFN". You want to be
> > > precise
> > > here,
> > > to avoid confusion later on. (I say that knowing that we're still
> > > fighting
> > > especially shadow paging code on x86 not having those properly
> > > separated.)
> > If it is needed to be precise and mfn is valid isn't the same as
> > "mfn
> > != INVALID_MFN" only for the case of shadow paging?
> 
> No, I used shadow paging only as an example of where we have similar
> issues. I'd like to avoid that a new port starts out with introducing
> more instances of that. You want to properly separate INVALID_MFN
> from
> "invalid MFN", where the latter means any MFN where either nothing
> exists at all, or (see mfn_valid()) where no struct page_info exists.
Well, now I think I understand the difference between "INVALID_MFN" and
"invalid MFN."

Referring back to your original reply, I need to update the comment
above pt_update():
```
   ...
     * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set then it
   means that inserting will be done.
```
Would this be correct and more precise?

Based on the code for mfn_valid(), the separation is currently done
using the max_page value, which can't be initialized at the moment as
it requires reading the device tree file to obtain the RAM end.

We could use a placeholder for the RAM end (for example, a very high
value like -1UL) and then add __mfn_valid() within pt_update().
However, I'm not sure if this approach aligns with what you consider by
proper separation between INVALID_MFN and "invalid MFN."

~ Oleksii



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.