[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Notes on stubdoms and latency on ARM
Hi George, On 17/07/17 12:28, George Dunlap wrote: On 07/17/2017 11:04 AM, Julien Grall wrote:Hi, On 17/07/17 10:25, George Dunlap wrote:On 07/12/2017 07:14 AM, Dario Faggioli wrote:On Fri, 2017-07-07 at 14:12 -0700, Stefano Stabellini wrote:On Fri, 7 Jul 2017, Volodymyr Babchuk wrote:Since you are using Credit, can you try to disable context switch rate limiting?Yep. You are right. In the environment described above (Case 2) I now get much better results: real 1.85 user 0.00 sys 1.85From 113 to 1.85 -- WOW! Obviously I am no scheduler expert, but shouldn't we advertise a bit better a scheduler configuration option that makes things _one hundred times faster_ ?!So, to be fair, so far, we've bitten this hard by this only on artificially constructed test cases, where either some extreme assumption were made (e.g., that all the vCPUs except one always run at 100% load) or pinning was used in a weird and suboptimal way. And there are workload where it has been verified that it helps making performance better (poor SpecVIRT results without it was the main motivation having it upstream, and on by default). That being said, I personally have never liked rate-limiting, it always looked to me like the wrong solution.In fact, I *think* the only reason it may have been introduced is that there was a bug in the credit2 code at the time such that it always had a single runqueue no matter what your actual pcpu topology was.FWIW, we don't yet parse the pCPU topology on ARM. AFAIU, we always tell Xen each CPU is in its own core. Will it have some implications in the scheduler?Just checking -- you do mean its own core, as opposed to its own socket? (Or NUMA node?) I don't know much about the scheduler, so I might say something stupid here :). Below the code we have for ARM /* XXX these seem awfully x86ish... */ /* representing HT siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_mask); /* representing HT and core siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_mask); static void setup_cpu_sibling_map(int cpu) { if ( !zalloc_cpumask_var(&per_cpu(cpu_sibling_mask, cpu)) || !zalloc_cpumask_var(&per_cpu(cpu_core_mask, cpu)) ) panic("No memory for CPU sibling/core maps"); /* A CPU is a sibling with itself and is always on its own core. */ cpumask_set_cpu(cpu, per_cpu(cpu_sibling_mask, cpu)); cpumask_set_cpu(cpu, per_cpu(cpu_core_mask, cpu)); } #define cpu_to_socket(_cpu) (0)After calling setup_cpu_sibling_map, we never touch cpu_sibling_mask and cpu_core_mask for a given pCPU. So I would say that each logical CPU is in its own core, but they are all in the same socket at the moment. On any system without hyperthreading (or with HT disabled), that's what an x86 system will see as well. Most schedulers have one runqueue per logical cpu. Credit2 has the option of having one runqueue per logical cpu, one per core (i.e., hyperthreads share a runqueue), one runqueue per socket (i.e., all cores on the same socket share a runqueue), or one socket across the whole system. I *think* we made one socket per core the default a while back to deal with multithreading, but I may not be remembering correctly. In any case, if you don't have threads, then reporting each logical cpu as its own core is the right thing to do. The architecture doesn't disallow to do HT on ARM. Though, I am not aware of any cores using it today. If you're mis-reporting sockets, then the scheduler will be unable to take that into account. But that's not usually going to be a major issue, mainly because the scheduler is not actually in a position to determine, most of the time, which is the optimal configuration. If two vcpus are communicating a lot, then the optimal configuration is to put them on different cores of the same socket (so they can share an L3 cache); if two vcpus are computing independently, then the optimal configuration is to put them on different sockets, so they can each have their own L3 cache. Xen isn't in a position to know which one is more important, so it just assumes each vcpu is independent. All that to say: It shouldn't be a major issue if you are mis-reporting sockets. :-) Good to know, thank you for the explanation! We might want to parse the bindings correctly to get a bit of improvement. I will add a task on jira. Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |