[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()

To: Jan Beulich <jbeulich@xxxxxxxx>
From: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
Date: Fri, 14 Nov 2025 18:04:10 +0100
Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
Delivery-date: Fri, 14 Nov 2025 17:04:24 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 11/10/25 3:53 PM, Jan Beulich wrote:

On 20.10.2025 17:57, Oleksii Kurochko wrote:

--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -8,12 +8,45 @@
 #include <xen/rwlock.h>
 #include <xen/types.h>
 
+#include <asm/page.h>
 #include <asm/page-bits.h>
 
 extern unsigned char gstage_mode;
+extern unsigned int gstage_root_level;
 
 #define P2M_ROOT_ORDER  (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
 #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
+#define P2M_ROOT_LEVEL  gstage_root_level
+
+/*
+ * According to the RISC-V spec:
+ *   When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
+ *   or Sv57x4, G-stage address translation is a variation on the usual
+ *   page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
+ *   Sv57, respectively. In each case, the size of the incoming address is
+ *   widened by 2 bits (to 34, 41, 50, or 59 bits).
+ *
+ * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
+ * the index for this level of the P2M page table starts. The extra 2
+ * bits added by the "x4" schemes only affect the root page table width.
+ *
+ * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
+ * levels: the extra 2 bits do not change the indices of lower levels.
+ *
+ * The extra 2 bits are only relevant if one tried to address beyond the
+ * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
+ * invalid.
+ */
+#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)

Is the last paragraph of the comment really needed? It talks about something
absurd / impossible only.

Agree, it isn't really needed, lets drop it.

+#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
+
+#define P2M_PAGETABLE_ENTRIES(lvl) \
+    (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
+
+#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)

If I'm not mistaken, this is a mask with the low 10 or 12 bits set.

I'm not sure I fully understand you here. With the current implementation,
it returns a bitmask that corresponds to the number of index bits used
at each level. So, if P2M_ROOT_LEVEL = 2, then:
  GFN_MASK(0) = 0x1ff (9-bit GFN for the level 0)
  GFN_MASK(1) = 0x1ff (9-bit GFN width for level 1)
  GFN_MASK(2) = 0x7ff (11-bit GFN width for level 2)

Or do you mean that GFN_MASK(lvl) should return something like this:
  GFN_MASK_(0) = 0x1FF000 (0x1ff << 0xc)
  GFN_MASK_(1) = 0x3FE00000 (GFN_MASK_(0)<<9)
  GFN_MASK_(2) = 0x1FFC0000000 (GFN_MASK_(1)<<9 + extra 2 bits)
And then here ...

That's not really something you can apply to a GFN, unlike the name
suggests.

That is why virtual address should be properly shifted before, something
like it is done in calc_offset():
  (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);

...
 (va & GFN_MASK_(lvl)) >> P2M_LEVEL_SHIFT(lvl) ?
In this option more shifts will be needed.

Would it be better to just rename GFN_MASK() to P2M_PT_INDEX_MASK()? Or,
maybe, even just P2M_INDEX_MASK().

+#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)

Whereas here the macro name doesn't make clear what is shifted: An
address or a GFN. (It's the former, aiui.)

Yes, it is expected to be used to shift gfn.

The similar as with above would it be better to rename P2M_LEVEL_SHIFT to
P2M_GFN_LEVEL_SHIFT()?

--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -9,6 +9,7 @@
 #include <xen/rwlock.h>
 #include <xen/sched.h>
 #include <xen/sections.h>
+#include <xen/xvmalloc.h>
 
 #include <asm/csr.h>
 #include <asm/flushtlb.h>
@@ -17,6 +18,43 @@
 #include <asm/vmid.h>
 
 unsigned char __ro_after_init gstage_mode;
+unsigned int __ro_after_init gstage_root_level;

Like for mode, I'm unconvinced of this being a global (and not per-P2M /
per-domain).

The question is then if we really will (or want to) have cases when gstage
mode will be different per-domain/per-p2m?

+/*
+ * The P2M root page table is extended by 2 bits, making its size 16KB
+ * (instead of 4KB for non-root page tables). Therefore, P2M root page
+ * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+ * only allocates 4KB pages).
+ */
+#define ENTRIES_PER_ROOT_PAGE \
+    (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
+
+static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)

Where would a vaddr_t come from here? Your input are guest-physical addresses,
if I'm not mistaken.

You are right. Would it be right to 'paddr_t gpa' here? Or paddr_t is supposed to use
only with machine physical address?

+{
+    unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
+
+    /*
+     * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
+     * page table spans 4 consecutive 4KB pages.
+     * We want to return an index within one of these 4 pages.
+     * The specific page to use is determined by `p2m_get_root_pointer()`.
+     *
+     * Example: if `offset == 512`:
+     *  - A single 4KB page holds 512 entries.
+     *  - Therefore, entry 512 corresponds to index 0 of the second page.
+     *
+     * At all other levels, only one page is allocated, and `offset` is
+     * always in the range 0 to 511, since the VPN is 9 bits long.
+     */
+    return offset % ENTRIES_PER_ROOT_PAGE;

Seeing something "root" used here (when this is for all levels) is pretty odd,
despite all the commentary. Given all the commentary, why not simply

    return offset & ((1U << PAGETABLE_ORDER) - 1);

?

It works for all levels where lvl < P2M_ROOT_LEVEL, because in those cases the GFN
bit length is equal to PAGETABLE_ORDER. However, at the root level the GFN bit length
is 2 bits larger. So something like the following is needed:
  offset & ((1U << (PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl))) - 1);
This still returns an offset within a single 16 KB page, but in the case of the P2M
root we actually have four consecutive 4 KB pages, so the intention was to return
an offset inside one of those four 4 KB pages.

While writing the above, I started thinking whether calc_offset() could be implemented
much more simply. Since the root page table consists of four consecutive pages, it seems
acceptable to have the offset in the range [0, 2^11) instead of doing all the extra
manipulation to determine which of the four pages is used and the offset within that
specific page:

  static inline unsigned int calc_offset(unsigned int lvl, paddr_t gpa)
  {
     return (gpa >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
  }

  static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
  {
    return __map_domain_page(p2m->root);
  }

It probably still makes sense for p2m_get_root_pointer() to check that the root GFN
index is not larger than 2^11.

Am I missing something?

+}
+
+#define P2M_MAX_ROOT_LEVEL 4
+
+#define P2M_DECLARE_OFFSETS(var, addr) \
+    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
+    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
+        var[i] = calc_offset(i, addr);

This surely is more than just "declare", and it's dealing with all levels no
matter whether you actually will use all offsets.

I will rename P2M_DECLARE_OFFSETS to P2M_BUILD_LEVEL_OFFSETS().

But how can I know which offset I will actually need to use?
If we take the following loop as an example:
  for ( level = P2M_ROOT_LEVEL; level > target; level-- )
  {
    /*
     * Don't try to allocate intermediate page tables if the mapping
     * is about to be removed.
     */
     rc = p2m_next_level(p2m, !removing_mapping,
                         level, &table, offsets[level]);
     ...
  }
It walks from P2M_ROOT_LEVEL down to target, where target is determined at runtime.

If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
smaller. But I don't think it is a real issue if the offsets[] array on the stack has a
few extra unused entries.

If preferred, I could allocate the array dynamically based on gstage_root_level.
Would that be better?

@@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
     return rc;
 }
 
+/*
+ * Map one of the four root pages of the P2M root page table.
+ *
+ * The P2M root page table is larger than normal (16KB instead of 4KB),
+ * so it is allocated as four consecutive 4KB pages. This function selects
+ * the appropriate 4KB page based on the given GFN and returns a mapping
+ * to it.
+ *
+ * The caller is responsible for unmapping the page after use.
+ *
+ * Returns NULL if the calculated offset into the root table is invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+    unsigned long root_table_indx;
+
+    root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);

With the variable name shortened (to e.g. idx) this could be its initializer
without ending up with too long a line. The root_table_ prefix isn't really
adding much value in the context of this function.

+    if ( root_table_indx >= P2M_ROOT_PAGES )
+        return NULL;
+
+    /*
+     * The P2M root page table is extended by 2 bits, making its size 16KB
+     * (instead of 4KB for non-root page tables). Therefore, p2m->root is
+     * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+     * only allocates 4KB pages).
+     *
+     * Initially, `root_table_indx` is derived directly from `va`.

There's no 'va' here.

Should be gfn.

+static inline void p2m_clean_pte(pte_t *p, bool clean_pte)

"clean_pte" as a parameter name of a function of this name is, well, odd.

clean_cache should be better:
  p2m_clean_pte(pte_t *p, bool clean_cache)

I suppose it would be nice to rename everywhere clean_pte to clean_cache.

Thanks.

~ Oleksii

Follow-Ups:
- Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
  - From: Jan Beulich

References:
- Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
  - From: Jan Beulich

Prev by Date: Re: [PATCH for-4.21] x86/ucode: Add extra rows to the entrysign model table (again)
Next by Date: [XEN][PATCH] common/libfdt: optimize usage
Previous by thread: Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Next by thread: Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.