[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen

On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli
<dario.faggioli@xxxxxxxxxx> wrote:
> [Adding George again, and a few Linux/Xen folks]
> On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote:
>> In virtualized environments, sometimes we need to limit the CPU
>> resources to a virtual machine(VM). For example in Xen, we use
>> $ xl sched-credit -d 1 -c 50
>> to limit the CPU resource of dom 1 as half of
>> one physical CPU core. If the VM CPU resource is capped, the process
>> inside the VM will have a vruntime accounting problem. Here, I report
>> my findings about Linux process scheduler under the above scenario.
> Thanks for this other report as well. :-)
> All you say makes sense to me, and I will think about it. I'm not sure
> about one thing, though...

Hi Dario,

Thank you for your reply.

>> ------------Description------------
>> Linux CFS relies on delta_exec to charge the vruntime of processes.
>> The variable delta_exec is the difference of a process starts and
>> stops running on a CPU. This works well in physical machine. However,
>> in virtual machine under capped resources, some processes might be
>> accounted with inaccurate vruntime.
>> For example, suppose we have a VM which has one vCPU and is capped to
>> have as much as 50% of a physical CPU. When process A inside the VM
>> starts running and the CPU resource of that VM runs out, the VM will
>> be paused. Next round when the VM is allocated new CPU resource and
>> starts running again, process A stops running and is put back to the
>> runqueue. The delta_exec of process A is accounted as its "real
>> execution time" plus the paused time of its VM. That will make the
>> vruntime of process A much larger than it should be and process A
>> would not be scheduled again for a long time until the vruntimes of
>> other
>> processes catch it.
>> ---------------------------------------
>> ------------Analysis----------------
>> When a process stops running and is going to put back to the
>> runqueue,
>> update_curr() will be executed.
>> [src/kernel/sched/fair.c]
>> static void update_curr(struct cfs_rq *cfs_rq)
>> {
>>     ... ...
>>     delta_exec = now - curr->exec_start;
>>     ... ...
>>     curr->exec_start = now;
>>     ... ...
>>     curr->sum_exec_runtime += delta_exec;
>>     schedstat_add(cfs_rq, exec_clock, delta_exec);
>>     curr->vruntime += calc_delta_fair(delta_exec, curr);
>>     update_min_vruntime(cfs_rq);
>>     ... ...
>> }
>> "now" --> the right now time
>> "exec_start" --> the time when the current process is put on the CPU
>> "delta_exec" --> the time difference of a process between it starts
>> and stops running on the CPU
>> When a process starts running before its VM is paused and the process
>> stops running after its VM is unpaused, the delta_exec will include
>> the VM suspend time which is pretty large compared to the real
>> execution time of a process.
> ... but would that also apply to a VM that is not scheduled --just
> because of pCPU contention, not because it was paused-- for a few time?

Thanks for your suggestion. I have tried to see whether this issue
exists on pCPU sharing today. Unfortunately, I found this issue was
there, not only for capping case, but also for pCPU sharing case.

In the above both cases, the process vruntime accounting in guest OS
has "vruntime jump", which might cause that victim process to have
poor and unpredictable performance.

In the cloud, from my point of view, the VM exists in three scenarios:
1, dedicated hardware(in this case, VM = Physical Machine);
2, part of dedicated hardware(using capping, like Amazon EC2 T2.small instance);
3, sharing with other VMs on the same hardware;

Both case#2 and case#3 will be influenced due to the issue I mentioned.

> Isn't there anything in place in Xen or Linux (the latter being better
> suitable for something like this, IMHO) to compensate for that?

No. I do not think so. I think this is a bug in Linux kernel under
virtualization(vmm platform is Xen).

> I have to admit I haven't really ever checked myself, maybe either
> George or our Linux people do know more?

The issue behind it is that the process execution calculation(e.g.,
delta_exec) in virtualized environment should not be calculated as it
did in physical enviroment.

Here are two solutions to fix it:

1) Based on the vcpu->runstate.time(running/runnable/block/offline)
changes, to determine how much time the process on this VCPU is
running, instead of just "delta_exec = now - exec_start";

2) Build another clock inside the guest OS which records the exect
time that the VCPU runs. All vruntime calculation is based on this
clock, instead of hyperivosr clock/time(real clock).


>> This issue will make a great performance harm to the victim process.
>> If the process is an I/O-bound workload, its throughput and latency
>> will be influenced. If the process is a CPU-bound workload, this
>> issue
>> will make its vruntime "unfair" compared to other processes under
>> CFS.
>> Because the CPU resource of some type VMs in the cloud are limited as
>> the above describes(like Amazon EC2 t2.small instance), I doubt that
>> will also harm the performance of public cloud instances.
>> ---------------------------------------
>> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
>> 3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux
>> 3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels
>> all have this issue.
>> Please confirm this bug. Thanks.
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Tony. S
Ph. D student of University of Colorado, Colorado Springs

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.