|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] Xen: Spread boot time page scrubbing across all available CPU's (v6)
On 13/06/14 19:05, Konrad Rzeszutek Wilk wrote:
> From: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>
>
> The page scrubbing is done in 128MB chunks in lockstep across all the
> non-SMT CPU's. This allows for the boot CPU to hold the heap_lock whilst each
> chunk is being scrubbed and then release the heap_lock when the CPU's are
> finished scrubing their individual chunk. This allows for the heap_lock to
> not be held continously and for pending softirqs are to be serviced
> periodically across the CPU's.
>
> The page scrub memory chunks are allocated to the CPU's in a NUMA aware
> fashion to reduce socket interconnect overhead and improve performance.
> Specifically in the first phase we scrub at the same time on all the
> NUMA nodes that have CPUs - we also weed out the SMT threads so that
> we only use cores (that gives a 50% boost). The second phase is for NUMA
> nodes that have no CPUs - for that we use the closest NUMA node's CPUs
> (non-SMT again) to do the job.
>
> This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
> 6386 machine from 49 seconds to 3 seconds.
> On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes
> to 63 seconds.
>
> Signed-off-by: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Reviewed-by: Tim Deegan <tim@xxxxxxx>
Functionally, Reviewed-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
A few minor nits...
>
> ---
>
> v2
> - Reduced default chunk size to 128MB
> - Added code to scrub NUMA nodes with no active CPU linked to them
> - Be robust to boot CPU not being linked to a NUMA node
>
> v3:
> - Don't use SMT threads
> - Take care of remainder if the number of CPUs (or memory) is odd
> - Restructure the worker thread
> - s/u64/unsigned long/
>
> v4:
> - Don't use all CPUs on non-CPU NUMA nodes, just closest one
> - Syntax, and docs updates
> - Compile on ARM
> - Fix bug when NUMA node has 0 pages
>
> v5:
> - Properly figure out best NUMA node.
> - Fix comments to be proper style.
>
> v6:
> - Missing page shift on default values, optimize cpumask usage.
> - Fix case of NODE having one page and below first_valid_mfn
> - Add Ack
> ---
> docs/misc/xen-command-line.markdown | 10 ++
> xen/common/page_alloc.c | 211
> ++++++++++++++++++++++++++++++++---
> xen/include/asm-arm/numa.h | 1 +
> 3 files changed, 204 insertions(+), 18 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.markdown
> b/docs/misc/xen-command-line.markdown
> index 25829fe..509f462 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -198,6 +198,16 @@ Scrub free RAM during boot. This is a safety feature to
> prevent
> accidentally leaking sensitive VM data into other VMs if Xen crashes
> and reboots.
>
> +### bootscrub_chunk
_ needs escaping with a backslash
> +> `= <size>`
> +
> +> Default: `134217728`
Due to the implicit 'k' suffix on sizes, this is not a useful hint as to
how to change the default.
Furthermore, `128M` is substantially clearer.
> +
> +Maximum RAM block size chunks to be scrubbed whilst holding the page heap
> lock
> +and not running softirqs. Reduce this if softirqs are not being run
> frequently
> +enough. Setting this to a high value may cause boot failure, particularly if
> +the NMI watchdog is also enabled.
> +
> ### cachesize
> > `= <size>`
>
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> ...
> - * Scrub all unallocated pages in all heap zones. This function is more
> - * convoluted than appears necessary because we do not want to continuously
> - * hold the lock while scrubbing very large memory areas.
> + * Scrub all unallocated pages in all heap zones. This function uses all
> + * online cpu's to scrub the memory in parallel.
> */
> void __init scrub_heap_pages(void)
> {
> - unsigned long mfn;
> - struct page_info *pg;
> + cpumask_t node_cpus, all_worker_cpus;
> + unsigned int i, j;
> + unsigned long offset, max_per_cpu_sz = 0;
> + unsigned long start, end;
> + unsigned long rem = 0;
> + int last_distance, best_node;
> + int cpus;
>
> if ( !opt_bootscrub )
> return;
>
> - printk("Scrubbing Free RAM: ");
> + cpumask_clear(&all_worker_cpus);
> + /* Scrub block size. */
> + chunk_size = opt_bootscrub_chunk >> PAGE_SHIFT;
> + if ( chunk_size == 0 )
> + chunk_size = MB(128) >> PAGE_SHIFT;
> +
> + /* Round #0 - figure out amounts and which CPUs to use. */
> + for_each_online_node ( i )
> + {
> + if ( !node_spanned_pages(i) )
> + continue;
> + /* Calculate Node memory start and end address. */
> + start = max(node_start_pfn(i), first_valid_mfn);
> + end = min(node_start_pfn(i) + node_spanned_pages(i), max_page);
> + /* Just in case NODE has 1 page and starts below first_valid_mfn. */
> + end = max(end, start);
> + /* CPUs that are online and on this node (if none, that it is OK). */
> + cpus = find_non_smt(i, &node_cpus);
> + cpumask_or(&all_worker_cpus, &all_worker_cpus, &node_cpus);
> + if ( cpus <= 0 )
> + {
> + /* No CPUs on this node. Round #2 will take of it. */
> + rem = 0;
> + region[i].per_cpu_sz = (end - start);
> + }
> + else
> + {
> + rem = (end - start) % cpumask_weight(&node_cpus);
> + region[i].per_cpu_sz = (end - start) /
> cpumask_weight(&node_cpus);
The 'cpus' variable still holds cpumask_weight(&node_cpus), and is
rather more efficient to use.
> + if ( region[i].per_cpu_sz > max_per_cpu_sz )
> + max_per_cpu_sz = region[i].per_cpu_sz;
> + }
> + region[i].start = start;
> + region[i].rem = rem;
> + cpumask_copy(®ion[i].cpus, &node_cpus);
> + }
> +
> + printk("Scrubbing Free RAM on %u nodes using %u CPUs\n",
> num_online_nodes(),
> + cpumask_weight(&all_worker_cpus));
Both of these are signed quantities, rather than unsigned.
~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |