[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2 2/2] sched_credit2.c: runqueue_per_core code



On 03/18/2015 08:53 AM, Dario Faggioli wrote:
>>> @@ -1986,9 +1982,13 @@ static void init_pcpu(const struct scheduler *ops, 
>>> int cpu)
>>>  static void *
>>>  csched2_alloc_pdata(const struct scheduler *ops, int cpu)
>>>  {
>>> -    /* Check to see if the cpu is online yet */
>>> -    /* Note: cpu 0 doesn't get a STARTING callback */
>>> -    if ( cpu == 0 || cpu_to_socket(cpu) >= 0 )
>>> +    /*
>>> +     * Actual initialization is deferred to when the pCPU will be
>>> +     * online, via a STARTING callback. The only exception is
>>> +     * the boot cpu, which does not get such a notification, and
>>> +     * hence needs to be taken care of here.
>>> +     */
>>> +    if ( system_state == SYS_STATE_boot )
>>>          init_pcpu(ops, cpu);

I think this will break adding a cpu to a cpupool after boot, won't it?

Don't we want effectively, "if ( is_boot_cpu() ||
is_cpu_topology_set_up() )"?

is_cpu_topology_set_up() could be "system_state >= SYS_STATE_smp_boot" I
suppose?

Is "cpu == 0" the best test for "is_boot_cpu()", or is there another test?

>>
>> Same here, plus the new condition at the first glance isn't matching
>> the old one, but perhaps that's intentional.
>>
> It is intentional, and that is why we're changing it! :-) Let me try to
> explain:
> 
> The idea, in both old and new code, is to call init_pcpu() either:
>  a) on the boot cpu, during boot
>  b) on non-boot cpu, *only* if they are online.
> 
> The issue is that, for assessing b), it checks whether
> cpu_to_socket(cpu) returns >=0 and, if it does, it assumes the cpu is
> online. That is not true, as cpu_data.phys_proc_id (which is what
> cpu_to_socket() go reading) is 0 even before any onlining and topolog
> identification has happened (it's a global).
> 
> Therefore, all the pcpus are initialized right away, and all are
> assigned to runqueue 0, as returned by cpu_to_socket() at this stage,
> which is clearly wrong.
> 
> In fact, this is the reason why, previous versions of this took the
> approach of initializing things such as cpu_to_socket() returned -1
> before topology identification.
> 
> In the new version, as you suggested, I'm using system_state to figure
> out whether we are dealing with the boot cpu, and that's it. :-)
> 
> Hope this clarifies things... If yes, I'll make sure to put a similar
> explanation in the changelog, when sending the patch officially.

So let's step back for a minute, and let me see if I can describe the
situation.

Unlike the other schedulers, credit2 needs to know the topology of a
cpus when setting it up, so that we can place it in the proper runqueue.

Setting up the per-cpu structures would normally happen in alloc_pdata.
 This is called in three different ways:

* During boot, on the boot processer, alloc_pdata is called from
scheduler_init().

* During boot, on the secondary processors, alloc_pdata is called from
a CPU_UP_PREPARE callback, which happens just before the cpu is actually
brought up.

* When switching a cpu to a new cpupool after boot, alloc_pdata is also
called from cpu_schedule_switch().

The "normal" place the cpu topology information can be found is in
global the cpu_data[] array, typically accessed by the cpu_to_socket()
or cpu_to_core() macros.

This topology information is written to cpu_data[] by
smp_store_cpu_info().  For the boot cpu, it happens in
smp_prepare_cpus();  For secondary cpus, it's happens in
start_secondary(), which is the code run on the cpu itself as it's being
brought up.

Unfortunately, at the moment, both of these places are after the
respective places where the alloc pdata is called for those cpus.
Flattening the entire x86 setup at the moment you'd see:
  alloc_pdata for boot cpu
  smp_store_cpu_info for boot cpu
  for each secondary cpu
    alloc_pdata for secondary cpu
    smp_store_cpu_info for secondary cpu

scheduler_init() is called before smp_prepare_cpus() in part because of
a dependency chain elsewhere: we cannot set up the idle domain until
scheduler_init() is called; and other places further on in the
initialization but before setting up cpu topology information assume
that the idle domain has been set up.

I haven't actually tracked down this dependency yet; nor have I explored
the possibility of populating cpu_data[] for the boot cpu earier than
it's done now.

I'm not sure exactly why we call alloc_pdata on CPU_UP_PREPARE rather
than CPU_STARTING -- whether that's just what seemed most sensible when
cpupools were created, or whether there's a dependency somewhere between
those two points.

I *think* that there probably cannot be a dependency: no cpu boot code
can (or should) depend on the cpu having been initialized, because it
doesn't happen when you're booting credit2; and no scheduler code in
alloc_pdata can (or should) depend on being in a pre-initialized state,
because it also happens when assigning an already-initialized cpu to a
cpupool.

At the moment we deal with the fact that the topology information for a
CPU is not available on CPU_UP_PREPARE by setting a callback that will
call init_pcpu() on the CPU_STARTING action.

The boot cpu will not get such a callback; but information about the
topology of the boot cpu *is* available in boot_cpu_data[].

We can use system_state to detect whether we've been called from
scheduler_init (system_state < SYS_STATE_smp_boot), or whether we've
been called after the whole thing has been done (system_state >=
SYS_STATE_active).  But while system_state == SYS_STATE_smp_boot (which
is where both of the normal schedule callback and the csched2-specific
callback happen at the moment), we can't determine whether a given cpu's
cpu_data[] has been initialized yet.

So one thing we could do is this:

* In csched2_alloc_pdata(), call init_pcpu() if system_state <=
SYS_STATE_smp_boot (to catch scheduler_init) or system_state >=
SYS_STATE_active (to catch cpupool callbacks).

* Keep the credit2-specific callback on CPU_STARTING.  In
csched2_cpu_starting(), call init_pcpu().  (Not sure if this handles cpu
offlining / onlining properly or not.)

* In init_pcpu(), if system_state < SYS_STATE_smp_boot, assume that
we've been called from scheduler_init and use boot_cpu_data[].
Otherwise, assume that the calls from the CPU_UP_PREPARE callback have
been filtered out, and that we're being called either from CPU_STARTING
or from cpu_schedlue_switch(), and use cpu_data[].

Another thing we could explore is this:

* Change scheduler.c to call alloc_pdata on CPU_STARTING rather than
CPU_UP_PREPARE, and get rid of the csched2-specific callback.

* In csched2_alloc_pdata, always call init_pcpu().  (Or perhaps just
rename init_pcpu() to csched2_alloc_pdata().)

* In init_pcpu, if system_state < SYS_STATE_smp_boot, assume that we've
been called from scheduler_init and use boot_cpu_data[]; otherwise,
assume that we're being called either from CPU_STARTING, or from
cpu_schedule_switch(), and use cpu_data[].

If this would work, I think it would be a lot better.  It would remove
the credit2-specific callback, and it would mean we wouldn't need an
additional test to filter out

In both cases there's a slight risk in using system_state to determine
whether to rely on cpu_data[] or not, because there's actually a window
for each processor after system_state == SYS_STATE_smp_boot where
cpu_data[] is *not* initialized, but it's not obvious from looking at
the data itself.  If the callback mechanisms ever change order with the
cpu_data[] initializations in the future, we risk a situation where
credit2 silently regresses to using a single massive runqueue.

So I think that at least for debug builds, we should make some way to
determine whether a given cpu_data[] has been initialized, so that at
least we an ASSERT() that it has.  Setting the values to BAD_APICID
before system_state == SYS_STATE_smp_boot would be one way to do that;
having an "initialized" flag in the structure (which defaults to zero)
would be another.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.