[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 6/7] xen/riscv: page table handling


  • To: oleksii.kurochko@xxxxxxxxx
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 29 Aug 2024 09:01:24 +0200
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Thu, 29 Aug 2024 07:01:34 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 28.08.2024 18:11, oleksii.kurochko@xxxxxxxxx wrote:
> On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote:
>> On 21.08.2024 18:06, Oleksii Kurochko wrote:
>>> Implement map_pages_to_xen() which requires several
>>> functions to manage page tables and entries:
>>> - pt_update()
>>> - pt_mapping_level()
>>> - pt_update_entry()
>>> - pt_next_level()
>>> - pt_check_entry()
>>>
>>> To support these operations, add functions for creating,
>>> mapping, and unmapping Xen tables:
>>> - create_table()
>>> - map_table()
>>> - unmap_table()
>>>
>>> Introduce internal macros starting with PTE_* for convenience.
>>> These macros closely resemble PTE bits, with the exception of
>>> PTE_SMALL, which indicates that 4KB is needed.
>>
>> What macros are you talking about here? Is this partially stale, as
>> only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being
>> added?
> I am speaking about macros connected to masks:
>      #define PTE_R_MASK(x)   ((x) & PTE_READABLE)
>      #define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
>      #define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
>    
>      #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
>    PTE_EXECUTABLE))

Some of which is did question further down in my reply. But what's
worse - by saying "closely resemble PTE bits, with the exception of
PTE_SMALL" you pretty clearly _do not_ refer to the macros above, but
to PTE_VALID etc.

>>> @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
>>>      return p.pte & PTE_VALID;
>>>  }
>>>  
>>> +inline bool pte_is_table(const pte_t p)
>>> +{
>>> +    return ((p.pte & (PTE_VALID |
>>> +                      PTE_READABLE |
>>> +                      PTE_WRITABLE |
>>> +                      PTE_EXECUTABLE)) == PTE_VALID);
>>> +}
>>
>> In how far is the READABLE check valid here? You (imo correctly) ...

Oh, I wrongly picked on READABLE when it should have been the WRITABLE
bit.

>>> +static inline bool pte_is_mapping(const pte_t p)
>>> +{
>>> +    return (p.pte & PTE_VALID) &&
>>> +           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
>>> +}
>>
>> ... don't consider this bit here.
> pte_is_mapping() seems to me is correct as according to the RISC-V
> privileged spec:
>    4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 
>    5. Otherwise, this PTE is a pointer to the next level of the page   
>    table.
>    5. A leaf PTE has been found. ...

Right. And then why do you check all three of r, x, and w, when the doc
mentions only r and x? There may be reasons, but such reasons then need
clearly stating in a code comment, for people to understand why the code
is not following the spec.

> and regarding pte_is_table() READABLE check is valid as we have to
> check only that pte.r = pte.x = 0. WRITABLE check should be dropped. Or
> just use define pte_is_table() as:
>    inline bool pte_is_table(const pte_t p)
>    {
>       return !pte_is_mapping(p);
>    }

You had it like this earlier on, didn't you? That's wrong, because for a
PTE to describe another page table level PTE_VALID needs to be set.

>>> --- /dev/null
>>> +++ b/xen/arch/riscv/pt.c
>>> @@ -0,0 +1,420 @@
>>> +#include <xen/bug.h>
>>> +#include <xen/domain_page.h>
>>> +#include <xen/errno.h>
>>> +#include <xen/mm.h>
>>> +#include <xen/mm-frame.h>
>>> +#include <xen/pmap.h>
>>> +#include <xen/spinlock.h>
>>> +
>>> +#include <asm/flushtlb.h>
>>> +#include <asm/page.h>
>>> +
>>> +static inline const mfn_t get_root_page(void)
>>> +{
>>> +    paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) <<
>>> PAGE_SHIFT;
>>> +
>>> +    return maddr_to_mfn(root_maddr);
>>> +}
>>> +
>>> +/* Sanity check of the entry. */
>>> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
>>> flags)
>>> +{
>>> +    /*
>>> +     * See the comment about the possible combination of (mfn,
>>> flags) in
>>> +     * the comment above pt_update().
>>> +     */
>>> +
>>> +    /* Sanity check when modifying an entry. */
>>> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
>>> +    {
>>> +        /* We don't allow modifying an invalid entry. */
>>> +        if ( !pte_is_valid(entry) )
>>> +        {
>>> +            printk("Modifying invalid entry is not allowed.\n");
>>
>> Perhaps all of these printk()s should be dprintk()?
> It could be dprintk() but at the same time I don't see any issue if it
> will be printed once.

What guarantees that it wouldn't be logged over and over? It's simply
bad practice to accompany all error returns with log messages, even
in release builds. Even if right now you're only in the bring-up phase,
you still want to have security in mind. If any such log message ended
up reachable from a guest-invoked path, an XSA would be needed.

>>  And not have a full
>> stop?
> By "full stop," do you mean something like panic() or BUG_ON()?

No. "Full stop" is the period at the end of a sentence (which shouldn't
normally be there at the end of log messages).

>>> +        /*
>>> +         * We don't allow replacing any valid entry.
>>> +         *
>>> +         * Note that the function pt_update() relies on this
>>> +         * assumption and will skip the TLB flush (when Svvptc
>>> +         * extension will be ratified). The function will need
>>> +         * to be updated if the check is relaxed.
>>> +         */
>>> +        if ( pte_is_valid(entry) )
>>> +        {
>>> +            if ( pte_is_mapping(entry) )
>>> +                printk("Changing MFN for a valid entry is not
>>> allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
>>> +                       mfn_x(mfn_from_pte(entry)), mfn_x(mfn));
>>> +            else
>>> +                printk("Trying to replace a table with a
>>> mapping.\n");
>>> +            return false;
>>> +        }
>>> +    }
>>> +    /* Sanity check when removing a mapping. */
>>> +    else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
>>
>> The PTE_VALID part of the check is pointless considering the earlier
>> if(). I guess you may want to have it for doc purposes ...
> Yes, it just helps to read the code and understand "confusing" if's()
> above.

Well, since you mention "confusing": I for one consider such redundant
checks confusing. It raises the question whether this check is wrong or
the earlier one is. Therefore, if you want to keep the redundancy, it
may help if you extend the comment to mention it's actually redundant
(e.g. by saying "for completeness" or some such).

>>> +#define XEN_TABLE_MAP_FAILED 0
>>> +#define XEN_TABLE_SUPER_PAGE 1
>>> +#define XEN_TABLE_NORMAL 2
>>> +
>>> +/*
>>> + * Take the currently mapped table, find the corresponding entry,
>>> + * and map the next table, if available.
>>> + *
>>> + * The alloc_tbl parameters indicates whether intermediate tables
>>> should
>>> + * be allocated when not present.
>>> + *
>>> + * Return values:
>>> + *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry
>>> + *  was empty, or allocating a new page failed.
>>> + *  XEN_TABLE_NORMAL: next level or leaf mapped normally
>>> + *  XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
>>> + */
>>> +static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned
>>> int offset)
>>
>> Having the boolean first is unusual, but well - it's your choice.
>>
>>> +{
>>> +    pte_t *entry;
>>> +    int ret;
>>> +    mfn_t mfn;
>>> +
>>> +    entry = *table + offset;
>>> +
>>> +    if ( !pte_is_valid(*entry) )
>>> +    {
>>> +        if ( alloc_tbl )
>>> +            return XEN_TABLE_MAP_FAILED;
>>
>> Is this condition meant to be inverted?
> if alloc_tbl = true we shouldn't allocatetable as:
>      * The intermediate page table shouldn't be allocated when MFN
> isn't
>      * valid and we are not populating page table.
> ...
>     */

Well, no. The variable name really shouldn't be the opposite of what is
meant. "alloc_tbl" can only possibly mean "allocate a table if none is
there". I can't think of a sensible interpretation in the inverted sense.
I'm curious how you mean to interpret that variable name.

>     bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags &
> PTE_POPULATE);
> 
> So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this
> table shouldn't be allocated and thereby pt_next_level() should return
> XEN_TABLE_MAP_FAILED.
> 
> Or to invert if ( alloc_tbl )it will be needed to invert defintion of
> alloc_tbl:
>  bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags & PTE_POPULATE);

Yes, as I did comment further down.

>>> +    if ( level != target )
>>> +    {
>>> +        printk("%s: Shattering superpage is not supported\n",
>>> __func__);
>>> +        rc = -EOPNOTSUPP;
>>> +        goto out;
>>> +    }
>>> +
>>> +    entry = table + offsets[level];
>>> +
>>> +    rc = -EINVAL;
>>> +    if ( !pt_check_entry(*entry, mfn, flags) )
>>> +        goto out;
>>> +
>>> +    /* We are removing the page */
>>> +    if ( !(flags & PTE_VALID) )
>>> +        memset(&pte, 0x00, sizeof(pte));
>>> +    else
>>> +    {
>>> +        /* We are inserting a mapping => Create new pte. */
>>> +        if ( !mfn_eq(mfn, INVALID_MFN) )
>>> +            pte = pte_from_mfn(mfn, PTE_VALID);
>>> +        else /* We are updating the permission => Copy the current
>>> pte. */
>>> +            pte = *entry;
>>> +
>>> +        /* update permission according to the flags */
>>> +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
>>
>> When updating an entry, don't you also need to clear (some of) the
>> flags?
> I am not sure why some flags should be cleared. Here we are taking only
> necessary for pte flags such as R, W, X or other bits in flags are
> ignored.

Consider what happens to a PTE with R and X set when a request comes in
to change to R/W. You'll end up with R, X, and W all set if you don't
first clear the bits that are meant to be changeable in a "modify"
operation.

>>> +/* Return the level where mapping should be done */
>>> +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned
>>> long nr,
>>> +                            unsigned int flags)
>>> +{
>>> +    unsigned int level = 0;
>>> +    unsigned long mask;
>>> +    unsigned int i;
>>> +
>>> +    /* Use blocking mapping unless the caller requests 4K mapping
>>> */
>>> +    if ( unlikely(flags & PTE_SMALL) )
>>> +        return level;
>>> +
>>> +    /*
>>> +     * Don't take into account the MFN when removing mapping (i.e
>>> +     * MFN_INVALID) to calculate the correct target order.
>>> +     *
>>> +     * `vfn` and `mfn` must be both superpage aligned.
>>> +     * They are or-ed together and then checked against the size
>>> of
>>> +     * each level.
>>> +     *
>>> +     * `left` is not included and checked separately to allow
>>> +     * superpage mapping even if it is not properly aligned (the
>>> +     * user may have asked to map 2MB + 4k).
>>
>> What is this about? There's nothing named "left" here.
> It refer to "remaining" pages or "leftover" space after trying to align
> a mapping to a superpage boundary.

What what is the quoted "left" here? Such a variable appears to exist in
the caller, but using the name here is lacking context.

>>> +/*
>>> + * If `mfn` equals `INVALID_MFN`, it indicates that the following
>>> page table
>>> + * update operation might be related to either populating the
>>> table (
>>> + * PTE_POPULATE will be set additionaly), destroying a mapping, or
>>> modifying
>>> + * an existing mapping.
>>
>> And the latter two are distinguished by? PTE_VALID?
> inserting -> (PTE_VALID=1 + (mfn=something valid))
> destroying-> ( PTE_VALID=0 )

Which then needs saying in the comment.

>>> + * If `mfn` is valid and flags has PTE_VALID bit set then it means
>>> that
>>> + * inserting will be done.
>>> + */
>>
>> What about mfn != INVALID_MFN and PTE_VALID clear?
> PTE_VALID=0 will be always considered as destroying and no matter what
> is mfn value as in this case the removing is done in the way where mfn
> isn't used:

Right, yet elsewhere you're restrictive as to MFN values valid to use.
Not requiring INVALID_MFN here looks inconsistent to me.

>         memset(&pte, 0x00, sizeof(pte));

Just to mention it: I don't think memset() is a very good way of clearing
a PTE, even if right here it's not a live one.

>>  Also note that "`mfn` is
>> valid" isn't the same as "mfn != INVALID_MFN". You want to be precise
>> here,
>> to avoid confusion later on. (I say that knowing that we're still
>> fighting
>> especially shadow paging code on x86 not having those properly
>> separated.)
> If it is needed to be precise and mfn is valid isn't the same as "mfn
> != INVALID_MFN" only for the case of shadow paging?

No, I used shadow paging only as an example of where we have similar
issues. I'd like to avoid that a new port starts out with introducing
more instances of that. You want to properly separate INVALID_MFN from
"invalid MFN", where the latter means any MFN where either nothing
exists at all, or (see mfn_valid()) where no struct page_info exists.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.