[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] mm/pdx: Add comments throughout the codebase for pdx


  • To: Alejandro Vallejo <alejandro.vallejo@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 6 Jul 2023 11:50:58 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=qU2zuEB2aWoDPFyG6rz9hcymQi25LlvFvg+n83waVrc=; b=a1+YVTZDnD+wJL/Lk1DbvtWHBPut15blUEG5eR2xCINEkj0LCPAE+ntC3dqcSd8Ew89FzhYfeJB4rOa7ChZPijt1V3IrVDZ5UV0RPVcLrgcJo+iQVEPSKTtQtb6dVkMbrHIYYZnEUNCXhFLMOfQV0oI8nv+d9FpDIKoj5vVgORl6IWw7M/T4Y3uBUv0DSTGq0BiW8/m7WFe4zq9yqgWNthgF31nqlgrq/lbtnUSc3XAduFLyyeqEYxIQ/36oPeis71A1n6LTvZFPrgktv3AqWYbth3Bw3OQ6t5AHA9f8t2QB3ZPepPWquzmVg2NfeUFZruoqRz0rJtCm+uSItGk5SA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=TWl01o09j+mH5zmoiSRsAMLaipay7z6zSX5+/rVPUTLWRwsZVr3KFB9dyNnyLs6GWwvmFBWyLcr1j21dO3eapGMwNwc7G+SWZGg96o8Q1d2/GORxoxCr/ocJOpscwJ8kC1qgXrS+l7H9BTW0cBfsk4Gw7f/K7iNcxYl45Ajz9p8HZa6LMRQxpA0ID/QbIixVXeHESdt18BuMoguWwod1m6iCiwTpIxLoSkzt2nXmYHHpq7AxYRyK5Ih/x/SPS+MQTyqaT2m9C53OOui8KapXNhwboEPSuBd35EYzwGp5wM26/cpKj9kBke+UCvo3vp8HwpDmuZwMpGVk0e0uUG4aMQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 06 Jul 2023 09:51:28 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 22.06.2023 16:02, Alejandro Vallejo wrote:
> @@ -57,9 +100,25 @@ uint64_t __init pdx_init_mask(uint64_t base_addr)
>                           (uint64_t)1 << (MAX_ORDER + PAGE_SHIFT)) - 1);
>  }
>  
> -u64 __init pdx_region_mask(u64 base, u64 len)
> +uint64_t __init pdx_region_mask(uint64_t base, uint64_t len)
>  {
> -    return fill_mask(base ^ (base + len - 1));
> +    uint64_t last = base + len - 1;
> +    /*
> +     * The only bit that matters in base^last is the MSB. There are 2 cases.
> +     *
> +     * case msb(base) < msb(last):
> +     *     then msb(fill_mask(base^last)) == msb(last). This is non
> +     *     compressible.
> +     * case msb(base) == msb(last):
> +     *     This means that there _may_ be a sequence of compressible zeroes
> +     *     for all addresses between `base` and `last` iff `base` has enough
> +     *     trailing zeroes. That is, it's compressible when

Why trailing zeros? [100000f000,10ffffffff] has compressible bits
32-35, but the low bits of base don't matter at all.

> +     *     msb(fill_mask(base^last)) < msb(last)

No caller uses the result this way, so I'm unconvinced it is helpful
to explain it here this way. This is also why I'm still not convinced
of the introduction of "last" (as a real variable and in the comment).
It's only the invariant bits in the range that we're after, as you
say ...

> +     * The resulting mask is effectively the moving bits between `base` and
> +     * `last`.

... here (where things could be expressed without "last").

In any event, nit: If you introduce a local variable, then it wants
following by a blank line.

> --- a/xen/include/xen/mm.h
> +++ b/xen/include/xen/mm.h
> @@ -31,6 +31,16 @@
>   *   (i.e. all devices assigned to) a guest share a single DMA address space
>   *   and, by default, Xen will ensure dfn == pfn.
>   *
> + * pdx: Page InDeX
> + *   Indices into the frame table holding the per-page's book-keeping
> + *   metadata. A compression scheme is used and there's a possibly non

s/is/may be/ ?

Also as said earlier at least on x86 pdx-es are also used as direct map
indices. I think this wants mentioning irrespective of what Arm does.

> --- a/xen/include/xen/pdx.h
> +++ b/xen/include/xen/pdx.h
> @@ -1,6 +1,67 @@
>  #ifndef __XEN_PDX_H__
>  #define __XEN_PDX_H__
>  
> +/*
> + * PDX (Page inDeX)
> + *
> + * This file deals with optimisations pertaining frame table indexing,

Nit: Missing "to"?

> + * A pdx is an index into the frame table. However, having an identity
> + * relationship between mfn and pdx could waste copious amounts of memory
> + * in empty frame table entries. There are some techniques to bring memory
> + * wastage down.

Like above the direct map wants mentioning here as well, I think.

> + * ## PDX grouping
> + *
> + * The frame table may have some sparsity even on systems where the memory
> + * banks are tightly packed. This is due to system quirks (like the PCI
> + * hole) which might introduce several GiB of unused page frame numbers
> + * that uselessly waste memory in the frame table. PDX grouping addresses
> + * this by keeping a bitmap of the ranges in the frame table containing
> + * invalid entries and not allocating backing memory for them.
> + *
> + * ## PDX compression
> + *
> + * This is a technique to avoid wasting memory on machines known to have
> + * split their machine address space in several big discontinuous and highly
> + * disjoint chunks.
> + *
> + * In its uncompressed form the frame table must have book-keeping metadata
> + * structures for every page between [0, max_mfn) (whether they are backed
> + * by RAM or not), and a similar condition exists for the direct map. We
> + * know some systems, however, that have some sparsity in their address
> + * space, leading to a lot of wastage in the form of unused frame table
> + * entries.
> + *
> + * This is where compression becomes useful. The idea is to note that if
> + * you have several big chunks of memory sufficiently far apart you can
> + * ignore the middle part of the address because it will always contain
> + * zeroes as long as the base address is sufficiently well aligned and the
> + * length of the region is much smaller than the base address.

As per above alignment of the base address doesn't really matter.

> + * i.e:
> + *   Consider 2 regions of memory. One starts at 0 while the other starts
> + *   at offset 2^off_h. Furthermore, let's assume both regions are smaller
> + *   than 2^off_l. This means that all addresses between [2^off_l, 2^off_h)
> + *   are invalid and we can assume them to be zero on all valid addresses.
> + *
> + *                 off_h     off_l
> + *                 |         |
> + *                 V         V
> + *         --------------------------
> + *         |HHHHHHH|000000000|LLLLLL| <--- mfn
> + *         --------------------------
> + *           ^ |
> + *           | | (de)compression by adding/removing "useless" zeroes
> + *           | V
> + *         ---------------
> + *         |HHHHHHHLLLLLL| <--- pdx
> + *         ---------------
> + *
> + * This scheme also holds for multiple regions, where HHHHHHH acts as
> + * the region identifier and LLLLLL fully contains the span of every
> + * region involved. Consider the following 3 regions.
> + */
> +
>  #ifdef CONFIG_HAS_PDX

Stray last sentence in the comment?

> @@ -13,22 +74,81 @@ extern unsigned long pfn_top_mask, ma_top_mask;
>                           (sizeof(*frame_table) & -sizeof(*frame_table)))
>  extern unsigned long pdx_group_valid[];
>  
> -extern uint64_t pdx_init_mask(u64 base_addr);
> -extern u64 pdx_region_mask(u64 base, u64 len);
> +/**
> + * Calculates a mask covering "moving" bits of all addresses of a region
> + *
> + * The i-th bit of the mask must be set if there's 2 different addresses
> + * in the region that have different j-th bits. where j >= i.
> + *
> + * e.g:
> + *       base=0x1B00000000
> + *   len+base=0x1B00082000
> + *
> + *   ought to return 0x00000FFFFF, which implies that every bit position
> + *   with a zero in the mask remains unchanged in every address of the
> + *   region.

Maybe the example would look "more generic" if the resulting mask didn't
consist of just 0s and Fs, e.g.

 *       base=0x1B00000000
 *   len+base=0x1B00042000
 *
 *   ought to return 0x000007FFFF, ...

> + * @param base Base address of the region
> + * @param len  Size in octets of the region
> + * @return Mask of moving bits at the bottom of all the region addresses
> + */
> +uint64_t pdx_region_mask(uint64_t base, uint64_t len);
>  
> -extern void set_pdx_range(unsigned long smfn, unsigned long emfn);
> +/**
> + * Creates a basic pdx mask

What is "basic" trying to describe here? "The mask to start from" would
look more to the point to me, plus saying that this is the (up front)
companion to pdx_region_mask().

> + * This mask is used to determine non-compressible bits. No address in the
> + * system shall have compressible bits covered by this initial mask.
> + *
> + * It's the larger of:
> + *   - A mask of MAX_ORDER + PAGESHIFT 1s
> + *   - base_addr rounded up to the nearest `2^n - 1`

I'm having trouble with this mathematically: Rounding always means
going to some proper multiple of some base number (granularity). This
doesn't fit with the value being 2^n-1. Maybe "padded"?

> + * @param base_addr Address of the first maddr in the system
> + * @return An integer of the form 2^n - 1
> + */
> +uint64_t pdx_init_mask(uint64_t base_addr);
> +
> +/**
> + * Mark [smfn, emfn) as allocatable in the frame table

What does "allocatable" mean here?

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.