[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] mm/pdx: Add comments throughout the codebase for pdx


  • To: Alejandro Vallejo <alejandro.vallejo@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 19 Jun 2023 17:30:20 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=lvHfrrnoyVSjGKh7R1x9wFrGs2e6/bnWsglZs5o6XjU=; b=DW03wr7tbgMHk9/K4MszwytFu5Xk06z/7tasmjRIUUjweH42bnyrhUy4eqno9kbt3vZkF8vul/rK8sMfdOLPbqfbe78tkNfZNxFOF7t8noUzpume6ICgfd/tQuClJWf9mpqMXVK1WgDhX+cPauFBLFlWYIAs/BEkK9rr9t/tc0F2fHamUcEDxDhBwyy/GgIN/xiUWb1oFckn058IKDktkxKlOuYn7PGuQIrR8nKBHJrGBLbADmWLWf9e/lsILUQY7+iXSyaKTQjCgJpNE2o0FxQmSeFfq5/5/4A0I+tkckSSsG5an5DEif0quDSfPB8ZQY3MJ2NfYusTMzkACU+8yg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=GDY0+stxNt+Hj9ieu9w4IS4dafkFuFMbrkoqqRBn+DU6i3cgzjKvvuX6lYfC4kwwVlUHiZ+FQ3QbmwM6X61nsex62HtyQdKprxyLZpUbAuEl49/hTekDeJ1zMrSM4UYvHl4E+z/bKzzwyleQL7Fhwtso/wgiDcG3U4iq9cPxgoDW1VxtJ4eOyKnyIiP3gnNUGkUtX9Qy7VM5e50I6nq9r+QMFCJNYKtW6IAYKcLm9p/GYmmjI1Yfhl+eojHVK0rLyWrhNUJgxGbjn4UsRVDwGYPRkUDbIPZ8nOroAbhUtqzFdaacgZapdq1Jn9aTVyCnyzB4fXc44sSh80pLsRXLhA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Mon, 19 Jun 2023 15:30:43 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 15.06.2023 18:27, Alejandro Vallejo wrote:
> --- a/xen/common/pdx.c
> +++ b/xen/common/pdx.c
> @@ -20,13 +20,55 @@
>  #include <xen/bitops.h>
>  #include <xen/nospec.h>
>  
> -/* Parameters for PFN/MADDR compression. */
> +/*
> + * Diagram to make sense of the following variables. The masks and shifts
> + * are done on mfn values in order to convert to/from pdx:
> + *
> + *                      pfn_hole_mask
> + *                      pfn_pdx_hole_shift (mask bitsize)
> + *                      |
> + *                 |---------|
> + *                 |         |
> + *                 V         V
> + *         --------------------------
> + *         |HHHHHHH|000000000|LLLLLL| <--- mfn
> + *         --------------------------
> + *         ^       ^         ^      ^
> + *         |       |         |------|
> + *         |       |             |
> + *         |       |             pfn_pdx_bottom_mask
> + *         |       |
> + *         |-------|
> + *             |
> + *             pfn_top_mask
> + *
> + * ma_{top,bottom}_mask is simply a shifted pfn_{top,pdx_bottom}_mask where 
> the
> + * bottom one shifts in 1s rather than 0s.
> + */

Nit: That 2nd bottom variable is ma_va_bottom_mask.

> @@ -57,9 +99,25 @@ uint64_t __init pdx_init_mask(uint64_t base_addr)
>                           (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
>  }
>  
> -u64 __init pdx_region_mask(u64 base, u64 len)
> +uint64_t __init pdx_region_mask(uint64_t base, uint64_t len)
>  {
> -    return fill_mask(base ^ (base + len - 1));
> +    uint64_t last = base + len - 1;
> +    /*
> +     * The only bit that matters in base^last is the MSB. There are 2 cases.
> +     *
> +     * case msb(base) < msb(last):
> +     *     then fill_mask(base^last) == fill_mask(last). This is non
> +     *     compressible.
> +     * case msb(base) == msb(last):
> +     *     This means that there _may_ be a sequence of compressible zeroes
> +     *     for all addresses between `base` and `last` iff `base` has enough
> +     *     trailing zeroes. That is, it's compressible when
> +     *     fill_mask(base^last) < fill_mask(last)
> +     *
> +     * The resulting mask is effectively the moving bits between `base` and
> +     * `last`
> +     */
> +    return fill_mask(base ^ last);
>  }

I don't see a need for you to actually change the code here. You can
as well introduce "last" as shorthand just for the comment. What I
dislike in your way of putting it is the use of fill_mask(last) when
such a call never occurs in code. Starting from the first sentence,
can't you explain things just in terms of said MSB (where the two
cases are "set" and "clear")?

> --- a/xen/include/xen/mm.h
> +++ b/xen/include/xen/mm.h
> @@ -31,6 +31,22 @@
>   *   (i.e. all devices assigned to) a guest share a single DMA address space
>   *   and, by default, Xen will ensure dfn == pfn.
>   *
> + * pdx: Page InDeX
> + *   Indices into the frame table holding the per-page's book-keeping
> + *   metadata. A compression scheme is used and there's a non-identity
> + *   mapping between valid(mfn) <-> valid(pdx) See the comments in pdx.c
> + *   for an in-depth explanation of that mapping.

The mapping may very well be (and on x86 typically is) an identity
one. IOW you want to describe not only the compression case, but also
the "no compression possible" one.

PDXes also aren't just indexes to the frame table, but also to the
direct mapping.

> + * maddr: Machine Address
> + *   The physical address that corresponds to an mfn
> + *
> + * vaddr: Xen Virtual Address
> + *   A virtual address of memory accesible to Xen. It is typically either
> + *   an address into the direct map or to Xen's own code/data. The direct
> + *   map implements several compression tricks to save memory, so an offset
> + *   into it does _not_ necessarily correspond to an maddr due to pdx
> + *   compression.

We need to be careful here: If I'm not mistaken at least Arm uses vaddr
also for guest addresses. In fact I'm not sure vaddr (and perhaps even
maddr) need explaining here, the more that nothing in this header uses
either term.

> --- a/xen/include/xen/pdx.h
> +++ b/xen/include/xen/pdx.h
> @@ -1,6 +1,62 @@
>  #ifndef __XEN_PDX_H__
>  #define __XEN_PDX_H__
>  
> +/*
> + * PDX (Page inDeX)
> + *
> + * This file deals with optimisations pertaining frame table indexing,
> + * A pdx is an index into the frame table. However, having an identity
> + * relationship between mfn and pdx could waste copious amounts of memory
> + * in empty frame table entries. There are some techniques to bring memory
> + * wastage down.
> + *
> + * ## PDX grouping
> + *
> + * The frame table may have some sparsity even on systems where the memory
> + * banks are tightly packed. This is due to system quirks (like the PCI
> + * hole) which might introduce several GiB of unused page frame numbers
> + * that uselessly waste memory in the frame table. PDX grouping addresses
> + * this by keeping a bitmap of the ranges in the frame table containing
> + * invalid entries and not allocating backing memory for them.
> + *
> + * ## PDX compression
> + *
> + * This is a technique to avoid wasting memory on machines known to have
> + * split their machine address space in two big discontinuous and highly
> + * disjoint chunks.

Why two? There can be any number, and in fact on the system I originally
had data from for reference (when first writing this code) iirc there
were 8 nodes, each with a chunk of memory far away from the other chunks.
The compression scheme used merely requires that some "inner" bits are
unused (always zero) in all of those ranges.

> + * In its uncompressed form the frame table must have book-keeping metadata
> + * structures for every page between [0, max_mfn) (whether they exist or

s/they exist/there is RAM/ ?

> + * not), and a similar condition exists for the direct map. We know some
> + * architectures, however, that have some sparsity in their address space,
> + * leading to a lot of wastage in the form of unused frame table entries.

Hmm, "architectures" suggests e.g. Arm might have such, but x86 won't.
Perhaps "systems", "designs", or "system designs"?

> @@ -13,22 +69,77 @@ extern unsigned long pfn_top_mask, ma_top_mask;
>                           (sizeof(*frame_table) & -sizeof(*frame_table)))
>  extern unsigned long pdx_group_valid[];
>  
> -extern uint64_t pdx_init_mask(u64 base_addr);
> -extern u64 pdx_region_mask(u64 base, u64 len);
> +/**
> + * Calculates a mask covering "moving" bits of all addresses of a region
> + *
> + * e.g:
> + *       base=0x1B00000000
> + *   len+base=0x1B0008200;
> + *
> + *   ought to return 0x00000FFFFF;
> + *
> + * @param base Base address of the region
> + * @param len  Size in octets of the region
> + * @return Mask of moving bits at the bottom of all the region addresses
> + */

This looks to be a copy-and-paste of pdx_region_mask()'s comment, when
the function has neither a "base" parameter, nor a and one at all.

> +uint64_t pdx_init_mask(u64 base_addr);

No u64 -> uint64_t here?

> -extern void set_pdx_range(unsigned long smfn, unsigned long emfn);
> +/**
> + * Calculates a mask covering "moving" bits of all addresses of a region
> + *
> + * e.g:
> + *       base=0x1B00000000
> + *   len+base=0x1B0008200;
> + *
> + *   ought to return 0x00000FFFFF;

I think it would help if you actually said how the return value actually
derives. The term "moving" may be understood differently be different
people, and hence such an explanation actually would also clarify what
"moving" means.

I also thing there's a 0 missing in the len+base value, without which
the result would be quite a bit different.

> + * @param base Base address of the region
> + * @param len  Size in octets of the region
> + * @return Mask of moving bits at the bottom of all the region addresses
> + */
> +uint64_t pdx_region_mask(uint64_t base, uint64_t len);
> +
> +/**
> + * Mark range between smfn and emfn is allocatable in the frame table
> + *
> + * @param smfn Start mfn
> + * @param emfn End mfn
> + */
> +void set_pdx_range(unsigned long smfn, unsigned long emfn);

This could do with mathematically expressing the range in the comment,
such that (in|ex)clusiveness of, in particular, emfn is clarified.

>  #define page_to_pdx(pg)  ((pg) - frame_table)
>  #define pdx_to_page(pdx) gcc11_wrap(frame_table + (pdx))
>  
> +/**
> + * Invoked to determine if an mfn maps to a legal pdx

I wouldn't use "pdx" here, but refer to frame_table[] instead.

> + * In order for it to be legal it must pass bounds, grouping and
> + * compression sanity checks.
> + *
> + * @param smfn Start mfn
> + * @param emfn End mfn
> + * @return True iff all checks pass
> + */
>  bool __mfn_valid(unsigned long mfn);

Comment again mentions inapplicable parameters.

> @@ -38,7 +149,16 @@ static inline unsigned long pdx_to_pfn(unsigned long pdx)
>  #define mfn_to_pdx(mfn) pfn_to_pdx(mfn_x(mfn))
>  #define pdx_to_mfn(pdx) _mfn(pdx_to_pfn(pdx))
>  
> -extern void pfn_pdx_hole_setup(unsigned long);
> +/**
> + * Initializes global variables with information about the compressible
> + * range of the current memory regions.
> + *
> + * @param mask This mask is the biggest pdx_mask of every region in the
> + *             system ORed with all base addresses of every region in the
> + *             system. The result is a mask where every sequence of zeroes
> + *             surrounded by ones is compressible.
> + */
> +void pfn_pdx_hole_setup(unsigned long mask);

With the function returning void, I find "The result" problematic. How about
"This results in ..."?

Btw, "surrounded by ones" isn't really necessary. We could compress shorter
sequences of zeros, so this may want re-wording a little to be as precise
as possible.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.