[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 11/17] xen/riscv: implement p2m_set_entry() and __p2m_set_entry()




On 7/7/25 2:53 PM, Jan Beulich wrote:
On 07.07.2025 13:46, Oleksii Kurochko wrote:
On 7/7/25 9:20 AM, Jan Beulich wrote:
On 04.07.2025 17:01, Oleksii Kurochko wrote:
On 7/1/25 3:49 PM, Jan Beulich wrote:
On 10.06.2025 15:05, Oleksii Kurochko wrote:
This patch introduces p2m_set_entry() and its core helper __p2m_set_entry() for
RISC-V, based loosely on the Arm implementation, with several RISC-V-specific
modifications.

Key differences include:
- TLB Flushing: RISC-V allows caching of invalid PTEs and does not require
    break-before-make (BBM). As a result, the flushing logic is simplified.
    TLB invalidation can be deferred until p2m_write_unlock() is called.
    Consequently, the p2m->need_flush flag is always considered true and is
    removed.
- Page Table Traversal: The order of walking the page tables differs from Arm,
    and this implementation reflects that reversed traversal.
- Macro Adjustments: The macros P2M_ROOT_LEVEL, P2M_ROOT_ORDER, and
    P2M_ROOT_PAGES are updated to align with the new RISC-V implementation.

The main functionality is in __p2m_set_entry(), which handles mappings aligned
to page table block entries (e.g., 1GB, 2MB, or 4KB with 4KB granularity).

p2m_set_entry() breaks a region down into block-aligned mappings and calls
__p2m_set_entry() accordingly.

Stub implementations (to be completed later) include:
- p2m_free_entry()
What would a function of this name do?
Recursively visiting all leaf PTE's for sub-tree behind an entry, then calls
put_page() (which will free if there is no any reference to this page),
freeing intermediate page table (after all entries were freed) by removing
it from d->arch.paging.freelist, and removes correspondent page of intermediate page
table from p2m->pages list.

You can clear entries, but you can't
free them, can you?
Is is a question regarding terminology?
Yes. If one sees a call to a function, it should be possible to at least
roughly know what it does without needing to go look at the implementation.

I can't free entry itself, but a page table or
a page (if it is a leaf entry) on which it points could free.
Then e.g. pte_free_subtree() or some such?
It sounds fine to me. I'll use suggested name.

Just want to notice that other arches also have the same function
for the same purpose with the same name.
As to x86, it's not general P2M code which uses this odd (for the purpose)
name, but only p2m-pt.c.

Does it make sense then to change a name for all arches?
I think so.

+{
+    panic("%s: isn't implemented for now\n", __func__);
+
+    return false;
+}
For this function in particular, though: Besides the "p2me" in the name
being somewhat odd (supposedly page table entries here are simply pte_t),
how is this going to be different from pte_is_valid()?
pte_is_valid() is checking a real bit of PTE, but p2me_is_valid() is checking
what is a type stored in the radix tree (p2m->p2m_types):
    /*
     * In the case of the P2M, the valid bit is used for other purpose. Use
     * the type to check whether an entry is valid.
     */
    static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
    {
        return p2m_type_radix_get(p2m, pte) != p2m_invalid;
    }

It is done to track which page was modified by a guest.
But then (again) the name doesn't convey what the function does.
Then probably p2me_type_is_valid(struct p2m_domain *p2m, pte_t pte) would better.
For P2M type checks please don't invent new naming, but use what both x86
and Arm are already using. Note how we already have p2m_is_valid() in that
set. Just that it's not doing what you want here.
Hm, why not doing what I want? p2m_is_valid() verifies if P2M entry is valid.
And in here it is checked if P2M pte is valid from P2M point of view by checking
the type in radix tree and/or in reserved PTEs bits (just to remind we have only 2
free bits for type).


  Plus
can't a guest also arrange for an entry's type to move to p2m_invalid?
That's then still an entry that was modified by the guest.
I am not really sure that I fully understand the question.
Do you ask if a guest can do something which will lead to a call of p2m_set_entry()
with p2m_invalid argument?
That I'm not asking, but rather stating. I.e. I expect such is possible.

If yes, then it seems like it will be done only in case of p2m_remove_mapping() what
will mean that alongside with p2m_invalid INVALID_MFN will be also passed, what means
this entry isn't expected to be used anymore.
Right. But such an entry would still have been "modified" by the guest.
Yes, but nothing then is needed to do with it. For example, if it is already invalid there
is not any sense to flush page to RAM (as in this case PTE's bit will be checked),
something like Arm does:
  https://elixir.bootlin.com/xen/v4.20.0/source/xen/arch/arm/p2m.c#L375


      
+        /*
+         * Don't take into account the MFN when removing mapping (i.e
+         * MFN_INVALID) to calculate the correct target order.
+         *
+         * XXX: Support superpage mappings if nr is not aligned to a
+         * superpage size.
+         */
Does this really need leaving as a to-do?
I think so, yes. It won’t break the current workflow if|nr| isn’t aligned,
a smaller order will simply be chosen.
Well, my question was more like "Isn't it simple enough to cover the case
right away?"

+        mask = !mfn_eq(smfn, INVALID_MFN) ? mfn_x(smfn) : 0;
+        mask |= gfn_x(sgfn) | nr;
+
+        for ( ; i != 0; i-- )
+        {
+            if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) )
+            {
+                    order = XEN_PT_LEVEL_ORDER(i);
+                    break;
Nit: Style.

+            }
+        }
+
+        rc = __p2m_set_entry(p2m, sgfn, order, smfn, t, a);
+        if ( rc )
+            break;
+
+        sgfn = gfn_add(sgfn, (1 << order));
+        if ( !mfn_eq(smfn, INVALID_MFN) )
+           smfn = mfn_add(smfn, (1 << order));
+
+        nr -= (1 << order);
Throughout maybe better be safe right away and use 1UL?

+    }
+
+    return rc;
   }
How's the caller going to know how much of the range was successfully
mapped?
There is no such option. Do other arches do that? I mean returns somehow
the number of successfully mapped (sgfn,smfn).
On x86 we had to introduce some not very nice code to cover for the absence
of proper handling there. For a new port I think it wants at least seriously
considering not to repeat such a potentially unhelpful pattern.

That part may need undoing (if not here, then in the caller),
or a caller may want to retry.
So the caller in the case if rc != 0, can just undoing the full range
(by using the same sgfn, nr, smfn).
Can it? How would it know what the original state was?
You're right — blindly unmapping the range assumes that no entries were valid
beforehand and I missed that it could be that something valid was mapped before
p2m_set_entry(sgfn,...,smfn) was called.
But then I am not really understand why it won't be an issue if will know
how many GFNs were successfully mapped.
The caller may know what that range's state was. But what I really wanted to
convey is: Updating multiple entries in one go is complicated in some of the
corner cases. You will want to think this through now, in order to avoid the
need to re-write everything later again.
I can add one more argument to return the number of successfully mapped GFNs.
Fortunately, that's very easy to do.
The problem for me is that I don’t really understand what the caller is supposed
to do with that information. The only use case I can think of is that the caller
might try to map the remaining GFNs again. But that doesn’t seem very useful,
if p2m_set_entry() wasn’t able to map the full range, it likely indicates a serious
issue, and retrying would probably result in the same error.
The same applies to rolling back the state. It wouldn’t be difficult to add a local
array to track all modified PTEs and then use it to revert the state if needed.
But again, what would the caller do after the rollback? At this point, it still seems
like the best option is simply to panic().

Basically, I don’t see or understand the cases where knowing how many GFNs were
successfully mapped, or whether a rollback was performed, would really help — because
in most cases, I don’t have a better option than just calling panic() at the end.
For example, if I call map_regions_p2mt() for an MMIO region described in a device
tree node, and the mapping fails partway through, I’m left with two options: either
ignore the device (if it's not essential for Xen or guest functionality) and continue
 booting; in which case I’d need to perform a rollback, and simply knowing the number
of successfully mapped GFNs may not be enough or, more likely, just panic.
Are there any realistic use cases where knowing the number of mapped GFNs or having
rollback support would actually allow us to avoid a panic?
Even more so, how would that information be used in the current call chain?
We have the following chain:
 map_regions_p2mt()p2m_insert()p2m_set_entry()

If p2m_set_entry() returns the number of successfully mapped GFNs, what should
p2m_insert() do with it — process it further, or just pass it along to
map_regions_p2mt()?
Thanks in advance for clarifications.

~ Oleksii

    

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.