[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()




On 11/10/25 3:53 PM, Jan Beulich wrote:
On 20.10.2025 17:57, Oleksii Kurochko wrote:
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -8,12 +8,45 @@
 #include <xen/rwlock.h>
 #include <xen/types.h>
 
+#include <asm/page.h>
 #include <asm/page-bits.h>
 
 extern unsigned char gstage_mode;
+extern unsigned int gstage_root_level;
 
 #define P2M_ROOT_ORDER  (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
 #define P2M_ROOT_PAGES  BIT(P2M_ROOT_ORDER, U)
+#define P2M_ROOT_LEVEL  gstage_root_level
+
+/*
+ * According to the RISC-V spec:
+ *   When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
+ *   or Sv57x4, G-stage address translation is a variation on the usual
+ *   page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
+ *   Sv57, respectively. In each case, the size of the incoming address is
+ *   widened by 2 bits (to 34, 41, 50, or 59 bits).
+ *
+ * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
+ * the index for this level of the P2M page table starts. The extra 2
+ * bits added by the "x4" schemes only affect the root page table width.
+ *
+ * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
+ * levels: the extra 2 bits do not change the indices of lower levels.
+ *
+ * The extra 2 bits are only relevant if one tried to address beyond the
+ * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
+ * invalid.
+ */
+#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)
Is the last paragraph of the comment really needed? It talks about something
absurd / impossible only.
Agree, it isn't really needed, lets drop it.


+#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
+
+#define P2M_PAGETABLE_ENTRIES(lvl) \
+    (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
+
+#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
I'm not sure I fully understand you here. With the current implementation,
it returns a bitmask that corresponds to the number of index bits used
at each level. So, if P2M_ROOT_LEVEL = 2, then:
  GFN_MASK(0) = 0x1ff (9-bit GFN for the level 0)
  GFN_MASK(1) = 0x1ff (9-bit GFN width for level 1)
  GFN_MASK(2) = 0x7ff (11-bit GFN width for level 2)

Or do you mean that GFN_MASK(lvl) should return something like this:
  GFN_MASK_(0) = 0x1FF000 (0x1ff << 0xc)
  GFN_MASK_(1) = 0x3FE00000 (GFN_MASK_(0)<<9)
  GFN_MASK_(2) = 0x1FFC0000000 (GFN_MASK_(1)<<9 + extra 2 bits)
And then here ...

    
That's not really something you can apply to a GFN, unlike the name
suggests.
That is why virtual address should be properly shifted before, something
like it is done in calc_offset():
  (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);

...
 (va & GFN_MASK_(lvl)) >> P2M_LEVEL_SHIFT(lvl) ?
In this option more shifts will be needed.

Would it be better to just rename GFN_MASK() to P2M_PT_INDEX_MASK()? Or,
maybe, even just P2M_INDEX_MASK().


+#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
Whereas here the macro name doesn't make clear what is shifted: An
address or a GFN. (It's the former, aiui.)
Yes, it is expected to be used to shift gfn.

The similar as with above would it be better to rename P2M_LEVEL_SHIFT to
P2M_GFN_LEVEL_SHIFT()?


--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -9,6 +9,7 @@
 #include <xen/rwlock.h>
 #include <xen/sched.h>
 #include <xen/sections.h>
+#include <xen/xvmalloc.h>
 
 #include <asm/csr.h>
 #include <asm/flushtlb.h>
@@ -17,6 +18,43 @@
 #include <asm/vmid.h>
 
 unsigned char __ro_after_init gstage_mode;
+unsigned int __ro_after_init gstage_root_level;
Like for mode, I'm unconvinced of this being a global (and not per-P2M /
per-domain).
The question is then if we really will (or want to) have cases when gstage
mode will be different per-domain/per-p2m? 


+/*
+ * The P2M root page table is extended by 2 bits, making its size 16KB
+ * (instead of 4KB for non-root page tables). Therefore, P2M root page
+ * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+ * only allocates 4KB pages).
+ */
+#define ENTRIES_PER_ROOT_PAGE \
+    (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
+
+static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
Where would a vaddr_t come from here? Your input are guest-physical addresses,
if I'm not mistaken.
You are right. Would it be right to 'paddr_t gpa' here? Or paddr_t is supposed to use
only with machine physical address?



+{
+    unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
+
+    /*
+     * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
+     * page table spans 4 consecutive 4KB pages.
+     * We want to return an index within one of these 4 pages.
+     * The specific page to use is determined by `p2m_get_root_pointer()`.
+     *
+     * Example: if `offset == 512`:
+     *  - A single 4KB page holds 512 entries.
+     *  - Therefore, entry 512 corresponds to index 0 of the second page.
+     *
+     * At all other levels, only one page is allocated, and `offset` is
+     * always in the range 0 to 511, since the VPN is 9 bits long.
+     */
+    return offset % ENTRIES_PER_ROOT_PAGE;
Seeing something "root" used here (when this is for all levels) is pretty odd,
despite all the commentary. Given all the commentary, why not simply

    return offset & ((1U << PAGETABLE_ORDER) - 1);

?
It works for all levels where lvl < P2M_ROOT_LEVEL, because in those cases the GFN
bit length is equal to PAGETABLE_ORDER. However, at the root level the GFN bit length
is 2 bits larger. So something like the following is needed:
  offset & ((1U << (PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl))) - 1);
This still returns an offset within a single 16 KB page, but in the case of the P2M
root we actually have four consecutive 4 KB pages, so the intention was to return
an offset inside one of those four 4 KB pages.

While writing the above, I started thinking whether calc_offset() could be implemented
much more simply. Since the root page table consists of four consecutive pages, it seems
acceptable to have the offset in the range [0, 2^11) instead of doing all the extra
manipulation to determine which of the four pages is used and the offset within that
specific page:

  static inline unsigned int calc_offset(unsigned int lvl, paddr_t gpa)
  {
     return (gpa >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
  }

  static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
  {
    return __map_domain_page(p2m->root);
  }
It probably still makes sense for p2m_get_root_pointer() to check that the root GFN
index is not larger than 2^11.
Am I missing something?

    

+}
+
+#define P2M_MAX_ROOT_LEVEL 4
+
+#define P2M_DECLARE_OFFSETS(var, addr) \
+    unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
+    for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
+        var[i] = calc_offset(i, addr);
This surely is more than just "declare", and it's dealing with all levels no
matter whether you actually will use all offsets.
I will rename P2M_DECLARE_OFFSETS to P2M_BUILD_LEVEL_OFFSETS().

But how can I know which offset I will actually need to use?
If we take the following loop as an example:
  for ( level = P2M_ROOT_LEVEL; level > target; level-- )
  {
    /*
     * Don't try to allocate intermediate page tables if the mapping
     * is about to be removed.
     */
     rc = p2m_next_level(p2m, !removing_mapping,
                         level, &table, offsets[level]);
     ...
  }
It walks from P2M_ROOT_LEVEL down to target, where target is determined at runtime.

If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
smaller. But I don't think it is a real issue if the offsets[] array on the stack has a
few extra unused entries.

If preferred, I could allocate the array dynamically based on gstage_root_level.
Would that be better?


    

@@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
     return rc;
 }
 
+/*
+ * Map one of the four root pages of the P2M root page table.
+ *
+ * The P2M root page table is larger than normal (16KB instead of 4KB),
+ * so it is allocated as four consecutive 4KB pages. This function selects
+ * the appropriate 4KB page based on the given GFN and returns a mapping
+ * to it.
+ *
+ * The caller is responsible for unmapping the page after use.
+ *
+ * Returns NULL if the calculated offset into the root table is invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+    unsigned long root_table_indx;
+
+    root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);
With the variable name shortened (to e.g. idx) this could be its initializer
without ending up with too long a line. The root_table_ prefix isn't really
adding much value in the context of this function.

+    if ( root_table_indx >= P2M_ROOT_PAGES )
+        return NULL;
+
+    /*
+     * The P2M root page table is extended by 2 bits, making its size 16KB
+     * (instead of 4KB for non-root page tables). Therefore, p2m->root is
+     * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+     * only allocates 4KB pages).
+     *
+     * Initially, `root_table_indx` is derived directly from `va`.
There's no 'va' here.
Should be gfn.


+static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
"clean_pte" as a parameter name of a function of this name is, well, odd.
clean_cache should be better:
  p2m_clean_pte(pte_t *p, bool clean_cache)

I suppose it would be nice to rename everywhere clean_pte to clean_cache.

Thanks.

~ Oleksii

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.