[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [for 4.22 v5 10/18] xen/riscv: implement p2m_set_range()
- To: Jan Beulich <jbeulich@xxxxxxxx>
- From: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
- Date: Fri, 14 Nov 2025 18:04:10 +0100
- Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
- Delivery-date: Fri, 14 Nov 2025 17:04:24 +0000
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
On 11/10/25 3:53 PM, Jan Beulich wrote:
On 20.10.2025 17:57, Oleksii Kurochko wrote:
--- a/xen/arch/riscv/include/asm/p2m.h
+++ b/xen/arch/riscv/include/asm/p2m.h
@@ -8,12 +8,45 @@
#include <xen/rwlock.h>
#include <xen/types.h>
+#include <asm/page.h>
#include <asm/page-bits.h>
extern unsigned char gstage_mode;
+extern unsigned int gstage_root_level;
#define P2M_ROOT_ORDER (ilog2(GSTAGE_ROOT_PAGE_TABLE_SIZE) - PAGE_SHIFT)
#define P2M_ROOT_PAGES BIT(P2M_ROOT_ORDER, U)
+#define P2M_ROOT_LEVEL gstage_root_level
+
+/*
+ * According to the RISC-V spec:
+ * When hgatp.MODE specifies a translation scheme of Sv32x4, Sv39x4, Sv48x4,
+ * or Sv57x4, G-stage address translation is a variation on the usual
+ * page-based virtual address translation scheme of Sv32, Sv39, Sv48, or
+ * Sv57, respectively. In each case, the size of the incoming address is
+ * widened by 2 bits (to 34, 41, 50, or 59 bits).
+ *
+ * P2M_LEVEL_ORDER(lvl) defines the bit position in the GFN from which
+ * the index for this level of the P2M page table starts. The extra 2
+ * bits added by the "x4" schemes only affect the root page table width.
+ *
+ * Therefore, this macro can safely reuse XEN_PT_LEVEL_ORDER() for all
+ * levels: the extra 2 bits do not change the indices of lower levels.
+ *
+ * The extra 2 bits are only relevant if one tried to address beyond the
+ * root level (i.e., P2M_LEVEL_ORDER(P2M_ROOT_LEVEL + 1)), which is
+ * invalid.
+ */
+#define P2M_LEVEL_ORDER(lvl) XEN_PT_LEVEL_ORDER(lvl)
Is the last paragraph of the comment really needed? It talks about something
absurd / impossible only.
Agree, it isn't really needed, lets drop it.
+#define P2M_ROOT_EXTRA_BITS(lvl) (2 * ((lvl) == P2M_ROOT_LEVEL))
+
+#define P2M_PAGETABLE_ENTRIES(lvl) \
+ (BIT(PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl), UL))
+
+#define GFN_MASK(lvl) (P2M_PAGETABLE_ENTRIES(lvl) - 1UL)
If I'm not mistaken, this is a mask with the low 10 or 12 bits set.
I'm not sure I fully understand you here. With the current implementation,
it returns a bitmask that corresponds to the number of index bits used
at each level. So, if P2M_ROOT_LEVEL = 2, then:
GFN_MASK(0) = 0x1ff (9-bit GFN for the level 0)
GFN_MASK(1) = 0x1ff (9-bit GFN width for level 1)
GFN_MASK(2) = 0x7ff (11-bit GFN width for level 2)
Or do you mean that GFN_MASK(lvl) should return something like this:
GFN_MASK_(0) = 0x1FF000 (0x1ff << 0xc)
GFN_MASK_(1) = 0x3FE00000 (GFN_MASK_(0)<<9)
GFN_MASK_(2) = 0x1FFC0000000 (GFN_MASK_(1)<<9 + extra 2 bits)
And then here ...
That's not really something you can apply to a GFN, unlike the name
suggests.
That is why virtual address should be properly shifted before, something
like it is done in calc_offset():
(va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
...
(va & GFN_MASK_(lvl)) >> P2M_LEVEL_SHIFT(lvl) ?
In this option more shifts will be needed.
Would it be better to just rename GFN_MASK() to P2M_PT_INDEX_MASK()? Or,
maybe, even just P2M_INDEX_MASK().
+#define P2M_LEVEL_SHIFT(lvl) (P2M_LEVEL_ORDER(lvl) + PAGE_SHIFT)
Whereas here the macro name doesn't make clear what is shifted: An
address or a GFN. (It's the former, aiui.)
Yes, it is expected to be used to shift gfn.
The similar as with above would it be better to rename P2M_LEVEL_SHIFT to
P2M_GFN_LEVEL_SHIFT()?
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -9,6 +9,7 @@
#include <xen/rwlock.h>
#include <xen/sched.h>
#include <xen/sections.h>
+#include <xen/xvmalloc.h>
#include <asm/csr.h>
#include <asm/flushtlb.h>
@@ -17,6 +18,43 @@
#include <asm/vmid.h>
unsigned char __ro_after_init gstage_mode;
+unsigned int __ro_after_init gstage_root_level;
Like for mode, I'm unconvinced of this being a global (and not per-P2M /
per-domain).
The question is then if we really will (or want to) have cases when gstage
mode will be different per-domain/per-p2m?
+/*
+ * The P2M root page table is extended by 2 bits, making its size 16KB
+ * (instead of 4KB for non-root page tables). Therefore, P2M root page
+ * is allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+ * only allocates 4KB pages).
+ */
+#define ENTRIES_PER_ROOT_PAGE \
+ (P2M_PAGETABLE_ENTRIES(P2M_ROOT_LEVEL) / P2M_ROOT_ORDER)
+
+static inline unsigned int calc_offset(unsigned int lvl, vaddr_t va)
Where would a vaddr_t come from here? Your input are guest-physical addresses,
if I'm not mistaken.
You are right. Would it be right to 'paddr_t gpa' here? Or paddr_t is supposed to use
only with machine physical address?
+{
+ unsigned int offset = (va >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
+
+ /*
+ * For P2M_ROOT_LEVEL, `offset` ranges from 0 to 2047, since the root
+ * page table spans 4 consecutive 4KB pages.
+ * We want to return an index within one of these 4 pages.
+ * The specific page to use is determined by `p2m_get_root_pointer()`.
+ *
+ * Example: if `offset == 512`:
+ * - A single 4KB page holds 512 entries.
+ * - Therefore, entry 512 corresponds to index 0 of the second page.
+ *
+ * At all other levels, only one page is allocated, and `offset` is
+ * always in the range 0 to 511, since the VPN is 9 bits long.
+ */
+ return offset % ENTRIES_PER_ROOT_PAGE;
Seeing something "root" used here (when this is for all levels) is pretty odd,
despite all the commentary. Given all the commentary, why not simply
return offset & ((1U << PAGETABLE_ORDER) - 1);
?
It works for all levels where lvl < P2M_ROOT_LEVEL, because in those cases the GFN
bit length is equal to PAGETABLE_ORDER. However, at the root level the GFN bit length
is 2 bits larger. So something like the following is needed:
offset & ((1U << (PAGETABLE_ORDER + P2M_ROOT_EXTRA_BITS(lvl))) - 1);
This still returns an offset within a single 16 KB page, but in the case of the P2M
root we actually have four consecutive 4 KB pages, so the intention was to return
an offset inside one of those four 4 KB pages.
While writing the above, I started thinking whether calc_offset() could be implemented
much more simply. Since the root page table consists of four consecutive pages, it seems
acceptable to have the offset in the range [0, 2^11) instead of doing all the extra
manipulation to determine which of the four pages is used and the offset within that
specific page:
static inline unsigned int calc_offset(unsigned int lvl, paddr_t gpa)
{
return (gpa >> P2M_LEVEL_SHIFT(lvl)) & GFN_MASK(lvl);
}
static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
{
return __map_domain_page(p2m->root);
}
It probably still makes sense for p2m_get_root_pointer() to check that the root GFN
index is not larger than 2^11.
Am I missing something?
+}
+
+#define P2M_MAX_ROOT_LEVEL 4
+
+#define P2M_DECLARE_OFFSETS(var, addr) \
+ unsigned int var[P2M_MAX_ROOT_LEVEL] = {-1};\
+ for ( unsigned int i = 0; i <= gstage_root_level; i++ ) \
+ var[i] = calc_offset(i, addr);
This surely is more than just "declare", and it's dealing with all levels no
matter whether you actually will use all offsets.
I will rename P2M_DECLARE_OFFSETS to P2M_BUILD_LEVEL_OFFSETS().
But how can I know which offset I will actually need to use?
If we take the following loop as an example:
for ( level = P2M_ROOT_LEVEL; level > target; level-- )
{
rc = p2m_next_level(p2m, !removing_mapping,
level, &table, offsets[level]);
...
}
It walks from P2M_ROOT_LEVEL down to target, where target is determined at runtime.
If you mean that, for example, when the G-stage mode is Sv39, there is no need to allocate
an array with 4 entries (or 5 entries if we consider Sv57, so P2M_MAX_ROOT_LEVEL should be
updated), because Sv39 only uses 3 page table levels — then yes, in theory it could be
smaller. But I don't think it is a real issue if the offsets[] array on the stack has a
few extra unused entries.
If preferred, I could allocate the array dynamically based on gstage_root_level.
Would that be better?
@@ -259,13 +308,293 @@ int p2m_set_allocation(struct domain *d, unsigned long pages, bool *preempted)
return rc;
}
+/*
+ * Map one of the four root pages of the P2M root page table.
+ *
+ * The P2M root page table is larger than normal (16KB instead of 4KB),
+ * so it is allocated as four consecutive 4KB pages. This function selects
+ * the appropriate 4KB page based on the given GFN and returns a mapping
+ * to it.
+ *
+ * The caller is responsible for unmapping the page after use.
+ *
+ * Returns NULL if the calculated offset into the root table is invalid.
+ */
+static pte_t *p2m_get_root_pointer(struct p2m_domain *p2m, gfn_t gfn)
+{
+ unsigned long root_table_indx;
+
+ root_table_indx = gfn_x(gfn) >> P2M_LEVEL_ORDER(P2M_ROOT_LEVEL);
With the variable name shortened (to e.g. idx) this could be its initializer
without ending up with too long a line. The root_table_ prefix isn't really
adding much value in the context of this function.
+ if ( root_table_indx >= P2M_ROOT_PAGES )
+ return NULL;
+
+ /*
+ * The P2M root page table is extended by 2 bits, making its size 16KB
+ * (instead of 4KB for non-root page tables). Therefore, p2m->root is
+ * allocated as four consecutive 4KB pages (since alloc_domheap_pages()
+ * only allocates 4KB pages).
+ *
+ * Initially, `root_table_indx` is derived directly from `va`.
There's no 'va' here.
Should be gfn.
+static inline void p2m_clean_pte(pte_t *p, bool clean_pte)
"clean_pte" as a parameter name of a function of this name is, well, odd.
clean_cache should be better:
p2m_clean_pte(pte_t *p, bool clean_cache)
I suppose it would be nice to rename everywhere clean_pte to clean_cache.
Thanks.
~ Oleksii
|