Xen project Mailing List

Re: [PATCH v5 6/7] xen/riscv: page table handling

Date: Thu, 29 Aug 2024 14:04:04 +0200

Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Thu, 29 Aug 2024 12:04:13 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, 2024-08-29 at 09:01 +0200, Jan Beulich wrote: > On 28.08.2024 18:11, oleksii.kurochko@xxxxxxxxx wrote: > > On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote: > > > On 21.08.2024 18:06, Oleksii Kurochko wrote: > > > > Implement map_pages_to_xen() which requires several > > > > functions to manage page tables and entries: > > > > - pt_update() > > > > - pt_mapping_level() > > > > - pt_update_entry() > > > > - pt_next_level() > > > > - pt_check_entry() > > > > > > > > To support these operations, add functions for creating, > > > > mapping, and unmapping Xen tables: > > > > - create_table() > > > > - map_table() > > > > - unmap_table() > > > > > > > > Introduce internal macros starting with PTE_* for convenience. > > > > These macros closely resemble PTE bits, with the exception of > > > > PTE_SMALL, which indicates that 4KB is needed. > > > > > > What macros are you talking about here? Is this partially stale, > > > as > > > only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being > > > added? > > I am speaking about macros connected to masks: > > #define PTE_R_MASK(x) ((x) & PTE_READABLE) > > #define PTE_W_MASK(x) ((x) & PTE_WRITABLE) > > #define PTE_X_MASK(x) ((x) & PTE_EXECUTABLE) > > > > #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | > > PTE_EXECUTABLE)) > > Some of which is did question further down in my reply. But what's > worse - by saying "closely resemble PTE bits, with the exception of > PTE_SMALL" you pretty clearly _do not_ refer to the macros above, but > to PTE_VALID etc. Agree, it should be corrected. > > > > > @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p) > > > > return p.pte & PTE_VALID; > > > > } > > > > > > > > +inline bool pte_is_table(const pte_t p) > > > > +{ > > > > + return ((p.pte & (PTE_VALID | > > > > + PTE_READABLE | > > > > + PTE_WRITABLE | > > > > + PTE_EXECUTABLE)) == PTE_VALID); > > > > +} > > > > > > In how far is the READABLE check valid here? You (imo correctly) > > > ... > > Oh, I wrongly picked on READABLE when it should have been the > WRITABLE > bit. > > > > > +static inline bool pte_is_mapping(const pte_t p) > > > > +{ > > > > + return (p.pte & PTE_VALID) && > > > > + (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE)); > > > > +} > > > > > > ... don't consider this bit here. > > pte_is_mapping() seems to me is correct as according to the RISC-V > > privileged spec: > > 4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to > > step > > 5. Otherwise, this PTE is a pointer to the next level of the > > page > > table. > > 5. A leaf PTE has been found. ... > > Right. And then why do you check all three of r, x, and w, when the > doc > mentions only r and x? There may be reasons, but such reasons then > need > clearly stating in a code comment, for people to understand why the > code > is not following the spec. So I remembered why R, W, and X are checked. There is contradictory information about these bits (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1317C64-L1321C10 ): ``` The permission bits, R, W, and X, indicate whether the page is readable, writable, and executable, respectively. When all three are zero, the PTE is a pointer to the next level of the page table; otherwise, it is a leaf PTE. ``` However, it is also written here (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1539 ) that only pte.r and pte.x should be checked. I can assume that the interpretation that R=W=X=0 indicates a pointer to the next level of the page table could come from this statement (https://github.com/riscv/riscv-isa-manual/blob/main/src/supervisor.adoc?plain=1#L1538 ): ``` If _pte_._v_ = 0, or if _pte_._r_ = 0 and _pte_._w_ = 1, or if any bits or encodings that are reserved for future standard use are set within _pte_, stop and raise a page-fault exception corresponding to the original access type. ``` >From this, I can assume that when pte.r = 0, pte.w should also always be zero; otherwise, a page-fault exception will be raised. ( but it is no obviously connected to if the PTE is a pointer to the next page table or not... ). > > > and regarding pte_is_table() READABLE check is valid as we have to > > check only that pte.r = pte.x = 0. WRITABLE check should be > > dropped. Or > > just use define pte_is_table() as: > > inline bool pte_is_table(const pte_t p) > > { > > return !pte_is_mapping(p); > > } > > You had it like this earlier on, didn't you? That's wrong, because > for a > PTE to describe another page table level PTE_VALID needs to be set. Agree, it's wrong, missed that. > > > > +#define XEN_TABLE_MAP_FAILED 0 > > > > +#define XEN_TABLE_SUPER_PAGE 1 > > > > +#define XEN_TABLE_NORMAL 2 > > > > + > > > > +/* > > > > + * Take the currently mapped table, find the corresponding > > > > entry, > > > > + * and map the next table, if available. > > > > + * > > > > + * The alloc_tbl parameters indicates whether intermediate > > > > tables > > > > should > > > > + * be allocated when not present. > > > > + * > > > > + * Return values: > > > > + * XEN_TABLE_MAP_FAILED: Either alloc_only was set and the > > > > entry > > > > + * was empty, or allocating a new page failed. > > > > + * XEN_TABLE_NORMAL: next level or leaf mapped normally > > > > + * XEN_TABLE_SUPER_PAGE: The next entry points to a > > > > superpage. > > > > + */ > > > > +static int pt_next_level(bool alloc_tbl, pte_t **table, > > > > unsigned > > > > int offset) > > > > > > Having the boolean first is unusual, but well - it's your choice. > > > > > > > +{ > > > > + pte_t *entry; > > > > + int ret; > > > > + mfn_t mfn; > > > > + > > > > + entry = *table + offset; > > > > + > > > > + if ( !pte_is_valid(*entry) ) > > > > + { > > > > + if ( alloc_tbl ) > > > > + return XEN_TABLE_MAP_FAILED; > > > > > > Is this condition meant to be inverted? > > if alloc_tbl = true we shouldn't allocatetable as: > > * The intermediate page table shouldn't be allocated when MFN > > isn't > > * valid and we are not populating page table. > > ... > > */ > > Well, no. The variable name really shouldn't be the opposite of what > is > meant. "alloc_tbl" can only possibly mean "allocate a table if none > is > there". I can't think of a sensible interpretation in the inverted > sense. > I'm curious how you mean to interpret that variable name. My interpretation was that alloc_tbl = true means that algorithm is trying to allocate the table what is forbidden at the moment but I agree that your interpretation sounds more understandable based on the variable name. > > > bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags & > > PTE_POPULATE); > > > > So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this > > table shouldn't be allocated and thereby pt_next_level() should > > return > > XEN_TABLE_MAP_FAILED. > > > > Or to invert if ( alloc_tbl )it will be needed to invert defintion > > of > > alloc_tbl: > > bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags & > > PTE_POPULATE); > > Yes, as I did comment further down. > > > > > + if ( level != target ) > > > > + { > > > > + printk("%s: Shattering superpage is not supported\n", > > > > __func__); > > > > + rc = -EOPNOTSUPP; > > > > + goto out; > > > > + } > > > > + > > > > + entry = table + offsets[level]; > > > > + > > > > + rc = -EINVAL; > > > > + if ( !pt_check_entry(*entry, mfn, flags) ) > > > > + goto out; > > > > + > > > > + /* We are removing the page */ > > > > + if ( !(flags & PTE_VALID) ) > > > > + memset(&pte, 0x00, sizeof(pte)); > > > > + else > > > > + { > > > > + /* We are inserting a mapping => Create new pte. */ > > > > + if ( !mfn_eq(mfn, INVALID_MFN) ) > > > > + pte = pte_from_mfn(mfn, PTE_VALID); > > > > + else /* We are updating the permission => Copy the > > > > current > > > > pte. */ > > > > + pte = *entry; > > > > + > > > > + /* update permission according to the flags */ > > > > + pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | > > > > PTE_DIRTY; > > > > > > When updating an entry, don't you also need to clear (some of) > > > the > > > flags? > > I am not sure why some flags should be cleared. Here we are taking > > only > > necessary for pte flags such as R, W, X or other bits in flags are > > ignored. > > Consider what happens to a PTE with R and X set when a request comes > in > to change to R/W. You'll end up with R, X, and W all set if you don't > first clear the bits that are meant to be changeable in a "modify" > operation. That's definitely going to be a problem. I'll update the code then. > > > > > +/* Return the level where mapping should be done */ > > > > +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, > > > > unsigned > > > > long nr, > > > > + unsigned int flags) > > > > +{ > > > > + unsigned int level = 0; > > > > + unsigned long mask; > > > > + unsigned int i; > > > > + > > > > + /* Use blocking mapping unless the caller requests 4K > > > > mapping > > > > */ > > > > + if ( unlikely(flags & PTE_SMALL) ) > > > > + return level; > > > > + > > > > + /* > > > > + * Don't take into account the MFN when removing mapping > > > > (i.e > > > > + * MFN_INVALID) to calculate the correct target order. > > > > + * > > > > + * `vfn` and `mfn` must be both superpage aligned. > > > > + * They are or-ed together and then checked against the > > > > size > > > > of > > > > + * each level. > > > > + * > > > > + * `left` is not included and checked separately to allow > > > > + * superpage mapping even if it is not properly aligned > > > > (the > > > > + * user may have asked to map 2MB + 4k). > > > > > > What is this about? There's nothing named "left" here. > > It refer to "remaining" pages or "leftover" space after trying to > > align > > a mapping to a superpage boundary. > > What what is the quoted "left" here? Such a variable appears to exist > in > the caller, but using the name here is lacking context. Then I will update the comment and tell from where 'left' is coming. > > > > > > + * If `mfn` is valid and flags has PTE_VALID bit set then it > > > > means > > > > that > > > > + * inserting will be done. > > > > + */ > > > > > > What about mfn != INVALID_MFN and PTE_VALID clear? > > PTE_VALID=0 will be always considered as destroying and no matter > > what > > is mfn value as in this case the removing is done in the way where > > mfn > > isn't used: > > Right, yet elsewhere you're restrictive as to MFN values valid to > use. > Not requiring INVALID_MFN here looks inconsistent to me. but actually if we will leave ASSERT in pt_check_entry() we will be sure that we are here with mfn = INVALID_MFN: /* Sanity check when removing a mapping. */ else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 ) { /* We should be here with an invalid MFN. */ ASSERT(mfn_eq(mfn, INVALID_MFN)); > > > memset(&pte, 0x00, sizeof(pte)); > > Just to mention it: I don't think memset() is a very good way of > clearing > a PTE, even if right here it's not a live one. Just direct assigning would be better? > > > > Also note that "`mfn` is > > > valid" isn't the same as "mfn != INVALID_MFN". You want to be > > > precise > > > here, > > > to avoid confusion later on. (I say that knowing that we're still > > > fighting > > > especially shadow paging code on x86 not having those properly > > > separated.) > > If it is needed to be precise and mfn is valid isn't the same as > > "mfn > > != INVALID_MFN" only for the case of shadow paging? > > No, I used shadow paging only as an example of where we have similar > issues. I'd like to avoid that a new port starts out with introducing > more instances of that. You want to properly separate INVALID_MFN > from > "invalid MFN", where the latter means any MFN where either nothing > exists at all, or (see mfn_valid()) where no struct page_info exists. Well, now I think I understand the difference between "INVALID_MFN" and "invalid MFN." Referring back to your original reply, I need to update the comment above pt_update(): ``` ... * If `mfn` is valid ( exist ) and flags has PTE_VALID bit set then it means that inserting will be done. ``` Would this be correct and more precise? Based on the code for mfn_valid(), the separation is currently done using the max_page value, which can't be initialized at the moment as it requires reading the device tree file to obtain the RAM end. We could use a placeholder for the RAM end (for example, a very high value like -1UL) and then add __mfn_valid() within pt_update(). However, I'm not sure if this approach aligns with what you consider by proper separation between INVALID_MFN and "invalid MFN." ~ Oleksii

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.