[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] xen/mm: avoid watchdog timeout in dump_numa() on large domains


  • To: Bernhard Kaindl <bernhard.kaindl@xxxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 2 Jun 2026 13:40:20 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=SlNgACLTZDlzRA7+/Fey+BnzeUCXoDk1XPlPyUiZvp0=; b=mg+CmoMLiDvMAERzy7Sxk+Z+jb/9q2IT3wZvJnge9OPI655A2xFuEGT9PGXv/K7sWvJkH75izatVOX3g/HTwFDMuLaRLfNIaDHQMokvDuGJ+zrvRzL2YgWP1FFLvOoDnykg8tBqvcMXqw007zhPsOuHnHS9QHC2SavO4bvprKaKp7aharoiF+74/IXY8xSyYkqM4/LouDMrd/ro6bLyWBdaLJxXVDmDg1m3N40stVyOI6gXFYDnAhgbGAE7abcA2K8bnaI5KAHMg6EY5AD9tBp9cAelaGPAIClKwim7gbnWMUrEYIFyg53xFTJGhCuvFFJdsydlJLNsufwf6nWe2pQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=U5Lg0ZmVHTRhOxgAf7kD2u6y10PXM1uDanwJlFQVkXIF+yU51w1atnELciXW7IyI9RW2gFz4m7h6APYyT79/TCuLw9z9ROwRxA+iUBMWMUD3xPFRxiU2IIY7zWsxuoe3MUvstl1/DIARY48YqeB8lEGSbRsyIPXhad835aYVcOY/v9ri57HlY7p1JhhvmWyA8lSJgUrW92FCTC5VdE+S2ZndNWd3oNN5fEbyRUh00mILJ5ToBlyyJyWs3E5RoX0sKLAmbK/CjJajUdAKdS6vWb6YDkLW4TmxkGaWeERbdWeiSEUWgzuieSU4OkdSb8pryszjUAEgLHK3RxX12hmUow==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=citrix.com header.i="@citrix.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Teddy Astie <teddy.astie@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Tamas K Lengyel <tamas@xxxxxxxxxxxxx>, Alejandro Vallejo <alejandro.garciavallejo@xxxxxxx>, Marcus Granado <marcus.granado@xxxxxxxxxx>
  • Delivery-date: Tue, 02 Jun 2026 11:40:33 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hello,

This is missing a "From:" line that matches the first SoB in the
command message, if nit doesn't match the sender (which is the case
here).

On Tue, Jun 02, 2026 at 09:49:50AM +0100, Bernhard Kaindl wrote:
> Using the 'u' debug key invokes dump_numa(), which walks each domain's
> page list under page_alloc_lock to compute per-NUMA-node counts. On
> domains with many pages, this O(pages) operation can hold the lock long
> enough to trigger a watchdog timeout.
> 
> Replace the page-list walk with node_tot_pages[], a per-node counter
> maintained in struct domain. This reduces dump_numa()'s per-domain work
> from O(pages) to O(nodes).
> 
> Accounting via domain_adjust_tot_pages()
> 
> Most page-count updates flow through domain_adjust_tot_pages(). The
> helper takes the affected NUMA node and updates both tot_pages and
> node_tot_pages[node] in the same locked critical section.
> 
> A debug-only helper, assert_numa_page_count(), checks after each locked
> update that the node totals sum to tot_pages.
> 
> Accounting in memory_exchange()
> 
> memory_exchange() cannot use the same helper. It is preemptible, may
> fail and roll back part way through a chunk, and does not hold
> page_alloc_lock for the whole operation. The final success-path value of
> tot_pages is unchanged: by construction, the number of input pages stolen
> equals the number of output pages assigned. domain_commit_page_deltas()
> still applies the accumulated deltas to tot_pages, but their sum is zero.
> 
> Instead, it accumulates deferred per-node deltas in
> node_tot_pages_adjustments[] for the current chunk:
> 
>   When steal_page() is called with MEMF_no_refcount, tot_pages is
>   intentionally not decremented. A decrement of 1 is recorded for the
>   input page's node.
> 
>   When assign_page() is called with MEMF_no_refcount, tot_pages is
>   intentionally not incremented. An increment equal to the number of
>   pages in the output extent is recorded.
> 
> At the end of a successful chunk, these deltas are committed under
> page_alloc_lock. The net tot_pages delta is zero, while node_tot_pages[]
> is updated to reflect the NUMA-node movement.
> 
> Correctness on failure paths:
> 
>   Invariant 1: Pages allocated with MEMF_no_owner are not counted in
>   tot_pages until assign_page() succeeds. Freeing such pages with
>   free_domheap_pages() is therefore accounting-neutral.
> 
>   Invariant 2: Pages stolen with MEMF_no_refcount remain counted in
>   tot_pages until the deferred adjustment is committed. No window
>   exists in which a stolen page is absent from d->page_list and
>   already subtracted from tot_pages.
> 
> These invariants cover the failure cases:
> 
>   Input-side failure before output allocation: any input pages already
>   stolen are still on in_chunk_list. The fail path attempts to assign them
>   back to the domain. A successful reassign cancels the earlier negative
>   delta; if the reassign fails because the domain is dying, the negative
>   delta remains and is committed, reflecting that the page has been freed.
> 
>   OOM during output allocation (alloc_domheap_pages() returns NULL):
>   output pages already allocated before the failure are on out_chunk_list,
>   have no owner, and have never been counted in tot_pages. Freeing them
>   does not change accounting. Stolen input pages are handled as above:
>   reassignment cancels their negative deltas, while unreassigned pages are
>   deducted because the domain is dying.
> 
>   assign_page() failure because the domain is dying: output pages
>   assigned before the failure remain accounted to the domain and are
>   reclaimed later by domain_relinquish_resources(). The accumulated
>   deltas therefore contain positive entries for the successful output
>   assignments and negative entries for all stolen input pages.
>   Committing the net delta to both tot_pages and node_tot_pages[] leaves
>   the accounting consistent with the pages the domain still owns.
> 
>   Post-assignment failure in guest_physmap_add_page() or while copying the
>   new frame number back to the guest: at this point the output extent has
>   already been assigned to the domain and the input pages for the chunk
>   have already been freed. Input and output page counts are therefore
>   equal, so the committed update only redistributes per-node counts and
>   leaves tot_pages unchanged.

This commit message is way, way, too long and dense.  We need to
exercise some restrain and summarize the important bits.

> A functionally equivalent version of this change is included in the
> XenServer 9 pre-release and is widely exercised there. Debug Xen
> builds with the added consistency checks are also used in internal
> end-to-end regression testing (XenRT). For upstream submission, the
> change was additionally updated to support non-NUMA Xen builds and
> reviewed again for correctness, including the failure paths.

IMO the mentioning of XenServer 9 or XenRT should be a post-commit
message.

> Fixes: 4dff228603ba ("Walking the page lists needs the page_alloc lock")
> Signed-off-by: Alejandro Vallejo <alejandro.garciavallejo@xxxxxxx>

It's not acceptable to change the original SoB, Alejandro SoB on this
patch is:

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@xxxxxxxxx>

And it must be kept this way.

> Signed-off-by: Marcus Granado <marcus.granado@xxxxxxxxxx>
> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> Signed-off-by: Bernhard Kaindl <bernhard.kaindl@xxxxxxxxxx>
> ---
> Brief history of this code base: Work was started by Marcus Granado,
> adding per-node accounting to domain_adjust_tot_pages() using work
> by Alejandro Vallejo <alejandro.garciavallejo@xxxxxxx> as basis.
> 
> In a 2nd phase, Bernhard Kaindl and Roger Pau Monné implemented
> accumulating of per-node deltas in memory_exchange(), including
> handling across failure paths and consistency checks of the page
> counters after each update with CONFIG_DEBUG enabled. Bernhard
> updated dump_numa() to replace walking d->page_list under lock
> and Roger fixed handling of hypercall preemption. Andrew Cooper
> participated in this phase as well and Joash Robinson tested using
> end-to-end tests exercising debug-key 'u' and checking its output.
> 
> In a 3rd phase, Bernhard Kaindl added explanatory comments, prepared
> this commit message for review, fixed a failure path that can occur
> when memory_exchange() fails mid-exchange due to OOM, fixed the code
> for non-NUMA builds and consolidated loops that apply accumulated
> deltas into a single helper.

Why do you speak of yourself in 3rd person on the post-commit remark?

> ---
>  xen/arch/x86/mm.c             |  3 +-
>  xen/arch/x86/mm/mem_sharing.c |  4 +-
>  xen/common/domain.c           |  9 ++++
>  xen/common/grant_table.c      |  4 +-
>  xen/common/memory.c           | 79 +++++++++++++++++++++++++----------
>  xen/common/numa.c             | 14 +------
>  xen/common/page_alloc.c       | 40 ++++++++++++++++--
>  xen/include/xen/mm.h          | 12 +++++-
>  xen/include/xen/sched.h       | 24 +++++++++++
>  9 files changed, 143 insertions(+), 46 deletions(-)
> 
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index a158379e7734..a723a2c50a2f 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4445,7 +4445,8 @@ int steal_page(
>      page_list_del(page, &d->page_list);
>  
>      /* Unlink from original owner. */
> -    if ( !(memflags & MEMF_no_refcount) && !domain_adjust_tot_pages(d, -1) )
> +    if ( !(memflags & MEMF_no_refcount) &&
> +         !domain_adjust_tot_pages(d, page_to_nid(page), -1) )
>          drop_dom_ref = true;
>  
>      nrspin_unlock(&d->page_alloc_lock);
> diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
> index 5c7a0ff30e8b..89ae418be0ae 100644
> --- a/xen/arch/x86/mm/mem_sharing.c
> +++ b/xen/arch/x86/mm/mem_sharing.c
> @@ -723,7 +723,7 @@ static int page_make_sharable(struct domain *d,
>      if ( !validate_only )
>      {
>          page_set_owner(page, dom_cow);
> -        drop_dom_ref = !domain_adjust_tot_pages(d, -1);
> +        drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(page), -1);
>          page_list_del(page, &d->page_list);
>      }
>  
> @@ -769,7 +769,7 @@ static int page_make_private(struct domain *d, struct 
> page_info *page)
>      ASSERT(page_get_owner(page) == dom_cow);
>      page_set_owner(page, d);
>  
> -    if ( domain_adjust_tot_pages(d, 1) == 1 )
> +    if ( domain_adjust_tot_pages(d, page_to_nid(page), 1) == 1 )
>          get_knownalive_domain(d);
>      page_list_add_tail(page, &d->page_list);
>      nrspin_unlock(&d->page_alloc_lock);
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 8cb4241b0511..0b6afba2acdb 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1558,6 +1558,15 @@ void domain_destroy(struct domain *d)
>      /* Remove from the domlist/hash. */
>      domlist_remove(d);
>  
> +    /*
> +     * Final invariant check: all pages still owned by the dying domain must
> +     * be accounted for per-node before complete_domain_destroy() reclaims
> +     * them.  Debug-only; expands to a no-op in production builds.
> +     */

This should be a single line comment if anything:

/* Ensure per-node page counts match global one. */

> +    nrspin_lock(&d->page_alloc_lock);
> +    assert_numa_page_count(d);
> +    nrspin_unlock(&d->page_alloc_lock);
> +
>      /* Schedule RCU asynchronous completion of domain destroy. */
>      call_rcu(&d->rcu, complete_domain_destroy);
>  }
> diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
> index ac9fed600101..298662c3d69e 100644
> --- a/xen/common/grant_table.c
> +++ b/xen/common/grant_table.c
> @@ -2404,7 +2404,7 @@ gnttab_transfer(
>          }
>  
>          /* Okay, add the page to 'e'. */
> -        if ( unlikely(domain_adjust_tot_pages(e, 1) == 1) )
> +        if ( unlikely(domain_adjust_tot_pages(e, page_to_nid(page), 1) == 1) 
> )
>              get_knownalive_domain(e);
>  
>          /*
> @@ -2430,7 +2430,7 @@ gnttab_transfer(
>               * page in the page total
>               */
>              nrspin_lock(&e->page_alloc_lock);
> -            drop_dom_ref = !domain_adjust_tot_pages(e, -1);
> +            drop_dom_ref = !domain_adjust_tot_pages(e, page_to_nid(page), 
> -1);
>              nrspin_unlock(&e->page_alloc_lock);
>  
>              if ( okay /* i.e. e->is_dying due to the surrounding if() */ )
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index 9e4899f9fc63..00eeaf33aefc 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -673,6 +673,14 @@ static long 
> memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>      struct domain *d;
>      struct page_info *page;
>  
> +    /*
> +     * Deferred accounting for the current exchange chunk.  Keep it at
> +     * function scope because the fail: path also needs to commit or cancel
> +     * partial work.  domain_commit_page_deltas() clears applied entries, so
> +     * each new chunk starts with a zeroed accumulator.

This is way to verbose: you don't need to mention it must be kept at
function scope, as it's obvious by it's usage.  Regarding the zeroing,
you might also zeor at the start of each chunk, and avoid doing it in
domain_commit_page_deltas() if you think that would be clearer (and
forego the need of a comment to clarify).

> +     */
> +    long node_tot_pages_adjustments[MAX_NUMNODES] = {};
> +
>      if ( copy_from_guest(&exch, arg, 1) )
>          return -EFAULT;
>  
> @@ -822,6 +830,12 @@ static long 
> memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>                  }
>  
>                  page_list_add(page, &in_chunk_list);
> +                /*
> +                 * steal_page() with MEMF_no_refcount removes ownership but
> +                 * leaves the domain page counts unchanged; record the
> +                 * deferred -1 for this page's node.
> +                 */
> +                node_tot_pages_adjustments[page_to_nid(page)]--;
>  #ifdef CONFIG_X86
>                  put_gfn(d, gmfn + k);
>  #endif
> @@ -870,32 +884,31 @@ static long 
> memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>              if ( assign_page(page, exch.out.extent_order, d,
>                               MEMF_no_refcount) )
>              {
> -                unsigned long dec_count;
> -                bool drop_dom_ref;
> -
>                  /*
> -                 * Pages in in_chunk_list is stolen without
> -                 * decreasing the tot_pages. If the domain is dying when
> -                 * assign pages, we need decrease the count. For those pages
> -                 * that has been assigned, it should be covered by
> -                 * domain_relinquish_resources().
> +                 * Input pages for this chunk were stolen and freed without
> +                 * decreasing tot_pages.  Output extents already assigned
> +                 * before this failure will be released later by
> +                 * domain_relinquish_resources().  Commit the relative deltas
> +                 * now so concurrent reservation changes made under
> +                 * page_alloc_lock are preserved.  The net delta is strictly
> +                 * negative on this branch, so a zero post-commit tot_pages
> +                 * means the last reference must be dropped.
>                   */
> -                dec_count = (((1UL << exch.in.extent_order) *
> -                              (1UL << in_chunk_order)) -
> -                             (j * (1UL << exch.out.extent_order)));
> -
> -                nrspin_lock(&d->page_alloc_lock);
> -                drop_dom_ref = (dec_count &&
> -                                !domain_adjust_tot_pages(d, -dec_count));
> -                nrspin_unlock(&d->page_alloc_lock);
> -
> -                if ( drop_dom_ref )
> +                if ( !domain_commit_page_deltas(d, 
> node_tot_pages_adjustments) )
>                      put_domain(d);
>  
>                  free_domheap_pages(page, exch.out.extent_order);
>                  goto dying;
>              }
>  
> +            /*
> +             * assign_page() with MEMF_no_refcount gives ownership without
> +             * updating the domain page counts; record the deferred extent
> +             * size for this output node.
> +             */
> +            node_tot_pages_adjustments[page_to_nid(page)] +=
> +                1UL << exch.out.extent_order;
> +
>              if ( __copy_from_guest_offset(&gpfn, exch.out.extent_start,
>                                            (i << out_chunk_order) + j, 1) )
>              {
> @@ -915,8 +928,19 @@ static long 
> memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>          }
>          BUG_ON( !(d->is_dying) && (j != (1UL << out_chunk_order)) );
>  
> +        /*
> +         * If publishing output GFNs failed after assignment, fail: still
> +         * needs to commit the zero-net per-node redistribution.
> +         */
>          if ( rc )
>              goto fail;
> +
> +        /*
> +         * Success: commit the per-node redistribution.  The helper also
> +         * applies the deltas to tot_pages; the final value is unchanged
> +         * because the exchange is size-neutral.
> +         */
> +        domain_commit_page_deltas(d, node_tot_pages_adjustments);
>      }
>  
>      exch.nr_exchanged = exch.in.nr_extents;
> @@ -925,22 +949,31 @@ static long 
> memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>      rcu_unlock_domain(d);
>      return rc;
>  
> -    /*
> -     * Failed a chunk! Free any partial chunk work. Tell caller how many
> -     * chunks succeeded.
> -     */
> +    /* Failed a chunk.  Free partial work and report completed chunks. */
>   fail:
>      /*
>       * Reassign any input pages we managed to steal.  NB that if the assign
>       * fails again, we're on the hook for freeing the page, since we've 
> already
> -     * cleared PGC_allocated.
> +     * cleared PGC_allocated.  A successful reassignment cancels the -1 that
> +     * was recorded when the page was stolen.
>       */
>      while ( (page = page_list_remove_head(&in_chunk_list)) )
> +    {
>          if ( assign_pages(page, 1, d, MEMF_no_refcount) )
>          {
>              BUG_ON(!d->is_dying);
>              free_domheap_page(page);
>          }
> +        else
> +            node_tot_pages_adjustments[page_to_nid(page)] += 1;
> +    }
> +
> +    /*
> +     * Commit remaining deltas.  If all stolen pages were reassigned, this is
> +     * a zero-net update.  Otherwise the domain is dying and unreassigned
> +     * pages are deducted.
> +     */
> +    domain_commit_page_deltas(d, node_tot_pages_adjustments);
>  
>   dying:
>      rcu_unlock_domain(d);
> diff --git a/xen/common/numa.c b/xen/common/numa.c
> index ad75955a1622..9f38145579e0 100644
> --- a/xen/common/numa.c
> +++ b/xen/common/numa.c
> @@ -737,26 +737,14 @@ static void cf_check dump_numa(unsigned char key)
>      printk("Memory location of each domain:\n");
>      for_each_domain ( d )
>      {
> -        const struct page_info *page;
> -        unsigned int page_num_node[MAX_NUMNODES];
>          const struct vnuma_info *vnuma;
>  
>          process_pending_softirqs();
>  
>          printk("%pd (total: %u):\n", d, domain_tot_pages(d));
>  
> -        memset(page_num_node, 0, sizeof(page_num_node));
> -
> -        nrspin_lock(&d->page_alloc_lock);
> -        page_list_for_each ( page, &d->page_list )
> -        {
> -            i = page_to_nid(page);
> -            page_num_node[i]++;
> -        }
> -        nrspin_unlock(&d->page_alloc_lock);
> -
>          for_each_online_node ( i )
> -            printk("    Node %u: %u\n", i, page_num_node[i]);
> +            printk("    Node %u: %u\n", i, d->node_tot_pages[i]);

You are not holding any lock here (which I guess it's fine, because
this is just debug info), and hence you could call
process_pending_softirqs() while iteration over the nodes if you think
this might be an issue, ie:

nodeid_t nr = 0;
[...]
for_each_online_node ( i )
{
    printk("    Node %u: %u\n", i, d->node_tot_pages[i]);
    if ( !(++nr % 0x3f) )
        process_pending_softirqs();
}

>  
>          if ( !read_trylock(&d->vnuma_rwlock) )
>              continue;
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 1801afc96a0a..ab6f44a3f993 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -520,14 +520,45 @@ static unsigned long avail_heap_pages(
>      return free_pages;
>  }
>  
> -unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> +unsigned long domain_adjust_tot_pages(struct domain *d, nodeid_t node,
> +                                      long pages)
>  {
>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>      d->tot_pages += pages;
>  
> +#ifdef CONFIG_NUMA
> +    ASSERT(node != NUMA_NO_NODE);
> +    ASSERT(node_online(node));
> +    d->node_tot_pages[node] += pages;
> +    assert_numa_page_count(d);
> +#endif
> +
>      return d->tot_pages;
>  }
>  
> +unsigned long domain_commit_page_deltas(struct domain *d,
> +                                        long adjustments[MAX_NUMNODES])
> +{
> +    unsigned long tot_pages;
> +    nodeid_t node;
> +
> +    nrspin_lock(&d->page_alloc_lock);
> +    for_each_online_node ( node )
> +        if ( adjustments[node] )
> +        {
> +#ifdef CONFIG_NUMA
> +            d->node_tot_pages[node] += adjustments[node];
> +#endif
> +            d->tot_pages += adjustments[node];
> +            adjustments[node] = 0;
> +        }
> +    assert_numa_page_count(d);
> +    tot_pages = d->tot_pages;
> +    nrspin_unlock(&d->page_alloc_lock);
> +
> +    return tot_pages;
> +}
> +
>  #ifdef CONFIG_SYSCTL
>  void get_outstanding_claims(uint64_t *free_pages, uint64_t 
> *outstanding_pages)
>  {
> @@ -2924,7 +2955,7 @@ int assign_pages(
>              goto out;
>          }
>  
> -        if ( unlikely(domain_adjust_tot_pages(d, nr) == nr) )
> +        if ( unlikely(domain_adjust_tot_pages(d, page_to_nid(pg), nr) == nr) 
> )
>              get_knownalive_domain(d);
>      }
>  
> @@ -3066,7 +3097,8 @@ void free_domheap_pages(struct page_info *pg, unsigned 
> int order)
>                  }
>              }
>  
> -            drop_dom_ref = !domain_adjust_tot_pages(d, -(1 << order));
> +            drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(pg),
> +                                                    -(1 << order));
>  
>              rspin_unlock(&d->page_alloc_lock);
>  
> @@ -3274,7 +3306,7 @@ void free_domstatic_page(struct page_info *page)
>  
>      arch_free_heap_page(d, page);
>  
> -    drop_dom_ref = !domain_adjust_tot_pages(d, -1);
> +    drop_dom_ref = !domain_adjust_tot_pages(d, page_to_nid(page), -1);
>  
>      unprepare_staticmem_pages(page, 1, scrub_debug);
>  
> diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> index b3a35c4bc8d6..0737df02eda2 100644
> --- a/xen/include/xen/mm.h
> +++ b/xen/include/xen/mm.h
> @@ -68,6 +68,7 @@
>  #include <xen/types.h>
>  #include <xen/list.h>
>  #include <xen/spinlock.h>
> +#include <xen/numa.h>
>  #include <xen/perfc.h>
>  #include <public/memory.h>
>  
> @@ -131,7 +132,16 @@ mfn_t xen_map_to_mfn(unsigned long va);
>  int populate_pt_range(unsigned long virt, unsigned long nr_mfns);
>  /* Claim handling */
>  unsigned long __must_check domain_adjust_tot_pages(struct domain *d,
> -    long pages);
> +                                                   nodeid_t node, long 
> pages);
> +/*
> + * Commit accumulated per-node page deltas to d->tot_pages and (under NUMA)
> + * d->node_tot_pages[] under page_alloc_lock.  Applied entries are cleared
> + * so the same accumulator can be reused for the next batch.  Returns the
> + * post-commit value of d->tot_pages; callers that may be releasing the last
> + * reference on the domain should drop it when the return value is zero.
> + */
> +unsigned long domain_commit_page_deltas(struct domain *d,
> +                                        long adjustments[MAX_NUMNODES]);

Given the (current) usage of domain_commit_page_deltas() it might be
better to simply return bool to signal whether the domain reference
should be released?  Callers don't care about the exact amount of
pages.

>  int domain_set_claim_entries(struct domain *d, uint32_t nr_entries,
>                               const struct xen_memory_claim *claim_set);
>  int domain_get_claim_entries(struct domain *d, uint32_t *nr_entries,
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index f671e0c4c7b3..ae50f523e9f5 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -620,6 +620,11 @@ struct domain
>      unsigned int last_alloc_node;
>      spinlock_t node_affinity_lock;
>  
> +#ifdef CONFIG_NUMA
> +    /* Distribution of tot_pages across NUMA nodes. */

It would be helpful to note this array is protected by the
page_alloc_lock lock.

> +    unsigned int node_tot_pages[MAX_NUMNODES];
> +#endif
> +
>      /* vNUMA topology accesses are protected by rwlock. */
>      rwlock_t vnuma_rwlock;
>      struct vnuma_info *vnuma;
> @@ -698,6 +703,25 @@ static inline unsigned int domain_tot_pages(const struct 
> domain *d)
>      return d->tot_pages - d->extra_pages;
>  }
>  
> +/*
> + * Debug-only consistency check: the per-node page counts must sum to
> + * d->tot_pages.  Compiled out unless both NUMA and debug builds are
> + * configured; caller must hold page_alloc_lock.

This is way too verbose, and not very helpful.  It's obvious from the
guards below that the code is compiled out unless NUMA and DEBUG are
enabled.  I would just write:

/* Consistency check: ensure tot_pages == sum(node_tot_pages[]) */

> + */
> +static inline void assert_numa_page_count(const struct domain *d)
> +{
> +#if defined(CONFIG_NUMA) && defined(CONFIG_DEBUG)
> +    unsigned int i, node_total = 0;

i might be nodeid_t if we want to be type-accurate.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.