[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 4/4] xen: use idle vcpus to scrub pages



>>> On 17.06.14 at 13:49, <lliubbo@xxxxxxxxx> wrote:
> In case of heavy lock contention, use a percpu list.
>  - Delist a batch of pages to a percpu list from "scrub" free page list.
>  - Scrub pages on this percpu list.
>  - Add those clean pages to normal "head" free page list, merge with other

Did you mean "heap" here?

> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -587,12 +587,14 @@ int domain_kill(struct domain *d)
>          d->tmem_client = NULL;
>          /* fallthrough */
>      case DOMDYING_dying:
> +        enable_idle_scrub = 0;
>          rc = domain_relinquish_resources(d);
>          if ( rc != 0 )
>          {
>              BUG_ON(rc != -EAGAIN);
>              break;
>          }
> +        enable_idle_scrub = 1;
>          for_each_vcpu ( d, v )
>              unmap_vcpu_info(v);
>          d->is_dying = DOMDYING_dead;

???

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -79,6 +79,9 @@ PAGE_LIST_HEAD(page_offlined_list);
>  /* Broken page list, protected by heap_lock. */
>  PAGE_LIST_HEAD(page_broken_list);
>  
> +volatile bool_t enable_idle_scrub;
> +DEFINE_PER_CPU(struct page_list_head, scrub_list_cpu);

I'm pretty sure I pointed out to you before that variables used only in
a single file should be static.

> @@ -1387,7 +1390,86 @@ void __init scrub_heap_pages(void)
>      setup_low_mem_virq();
>  }
>  
> +#define SCRUB_BATCH 1024
> +void scrub_free_pages(void)
> +{
> +    struct page_info *pg;
> +    unsigned int i, j, node_empty = 0, nr_delisted = 0;
> +    int order;
> +    unsigned int cpu = smp_processor_id();
> +    unsigned int node = cpu_to_node(cpu);
> +    struct page_list_head *temp_list = &this_cpu(scrub_list_cpu);
> +
> +    if ( !enable_idle_scrub )
> +        return;
> +
> +    do
> +    {
> +        if ( page_list_empty(temp_list) )
> +        {
> +            /* Delist a batch of pages from global scrub list */
> +            spin_lock(&heap_lock);
> +            for ( j = 0; j < NR_ZONES; j++ )
> +            {
> +                for ( order = MAX_ORDER; order >= 0; order-- )
> +                {
> +                    if ( (pg = page_list_remove_head(&scrub(node, j, 
> order))) )
> +                    {
> +                        for ( i = 0; i < (1 << order); i++)
> +                            mark_page_offline(&pg[i], 0);

A page previously having caused #MC now suddenly became healthy
again?

> +
> +                        page_list_add_tail(pg, temp_list);
> +                        nr_delisted += (1 << order);
> +                        if ( nr_delisted > SCRUB_BATCH )
> +                        {
> +                            nr_delisted = 0;
> +                            spin_unlock(&heap_lock);
> +                            goto start_scrub;
> +                        }
> +                    }
> +                }
> +            }
> +
> +            node_empty = 1;
> +            spin_unlock(&heap_lock);
> +        }
> +        else
> +        {
> +start_scrub:

Labels need to be indented by at least one space.

> +            /* Scrub percpu list */
> +            while ( !page_list_empty(temp_list) )
> +            {
> +                pg = page_list_remove_head(temp_list);
> +                ASSERT(pg);
> +                order = PFN_ORDER(pg);
> +                for ( i = 0; i < (1 << order); i++ )
> +                {

Order here can again be almost arbitrarily high. You aren't allowed
to scrub e.g. 4Tb in one go.

> +                    ASSERT( test_bit(_PGC_need_scrub, &(pg[i].count_info)) );
> +                    scrub_one_page(&pg[i]);
> +                    pg[i].count_info &= ~(PGC_need_scrub);
> +                }
>  
> +                /* Add pages to free heap list */
> +                spin_lock(&heap_lock);

Between the dropping of the lock above and the re-acquiring of it
here the pages "stolen" can lead to allocation failures despite there
being available memory. This needs to be avoided if at all possible.

> +                for ( i = 0; i < (1 << order); i++ )
> +                {
> +                    ASSERT ( !test_bit(_PGC_need_scrub, &(pg[i].count_info)) 
> );
> +                    pg[i].count_info |= PGC_state_free;
> +                }
> +                ASSERT (node == phys_to_nid(page_to_maddr(pg)));
> +                merge_free_trunks(pg, order, 0);
> +                spin_unlock(&heap_lock);
> +
> +                if ( softirq_pending(cpu) )
> +                    return;
> +            }
> +        }
> +
> +        /* Scrub list of this node is empty */
> +        if ( node_empty )
> +            return;
> +    } while ( !softirq_pending(cpu) );
> +}

And all the logic needs to be made NUMA-aware, i.e. a CPU should
prefer to scrub local memory, and only resort to scrubbing distant
memory if no CPU on the correct node is idle. Plus (as we have seen
with Konrad's boot time scrubbing changes) you should avoid having
two hyperthreads within the same core doing scrubbing at the same
time.

In the end I'm getting the impression that this all wasn't properly
thought through yet. Convoys on the lock inside the idle processing
should e.g. also be avoided (or reduced to a minimum) - there's no
point preventing the entering of C states if all you're otherwise going
to do is spin on a lock.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.