[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen

On Mon, May 16, 2016 at 3:38 PM, Tony S <suokunstar@xxxxxxxxx> wrote:
> On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli
> <dario.faggioli@xxxxxxxxxx> wrote:
>> [Adding George again, and a few Linux/Xen folks]
>> On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote:
>>> In virtualized environments, sometimes we need to limit the CPU
>>> resources to a virtual machine(VM). For example in Xen, we use
>>> $ xl sched-credit -d 1 -c 50
>>> to limit the CPU resource of dom 1 as half of
>>> one physical CPU core. If the VM CPU resource is capped, the process
>>> inside the VM will have a vruntime accounting problem. Here, I report
>>> my findings about Linux process scheduler under the above scenario.
>> Thanks for this other report as well. :-)
>> All you say makes sense to me, and I will think about it. I'm not sure
>> about one thing, though...
> Hi Dario,
> Thank you for your reply.
>>> ------------Description------------
>>> Linux CFS relies on delta_exec to charge the vruntime of processes.
>>> The variable delta_exec is the difference of a process starts and
>>> stops running on a CPU. This works well in physical machine. However,
>>> in virtual machine under capped resources, some processes might be
>>> accounted with inaccurate vruntime.
>>> For example, suppose we have a VM which has one vCPU and is capped to
>>> have as much as 50% of a physical CPU. When process A inside the VM
>>> starts running and the CPU resource of that VM runs out, the VM will
>>> be paused. Next round when the VM is allocated new CPU resource and
>>> starts running again, process A stops running and is put back to the
>>> runqueue. The delta_exec of process A is accounted as its "real
>>> execution time" plus the paused time of its VM. That will make the
>>> vruntime of process A much larger than it should be and process A
>>> would not be scheduled again for a long time until the vruntimes of
>>> other
>>> processes catch it.
>>> ---------------------------------------
>>> ------------Analysis----------------
>>> When a process stops running and is going to put back to the
>>> runqueue,
>>> update_curr() will be executed.
>>> [src/kernel/sched/fair.c]
>>> static void update_curr(struct cfs_rq *cfs_rq)
>>> {
>>>     ... ...
>>>     delta_exec = now - curr->exec_start;
>>>     ... ...
>>>     curr->exec_start = now;
>>>     ... ...
>>>     curr->sum_exec_runtime += delta_exec;
>>>     schedstat_add(cfs_rq, exec_clock, delta_exec);
>>>     curr->vruntime += calc_delta_fair(delta_exec, curr);
>>>     update_min_vruntime(cfs_rq);
>>>     ... ...
>>> }
>>> "now" --> the right now time
>>> "exec_start" --> the time when the current process is put on the CPU
>>> "delta_exec" --> the time difference of a process between it starts
>>> and stops running on the CPU
>>> When a process starts running before its VM is paused and the process
>>> stops running after its VM is unpaused, the delta_exec will include
>>> the VM suspend time which is pretty large compared to the real
>>> execution time of a process.
>> ... but would that also apply to a VM that is not scheduled --just
>> because of pCPU contention, not because it was paused-- for a few time?
> Thanks for your suggestion. I have tried to see whether this issue
> exists on pCPU sharing today. Unfortunately, I found this issue was
> there, not only for capping case, but also for pCPU sharing case.
> In the above both cases, the process vruntime accounting in guest OS
> has "vruntime jump", which might cause that victim process to have
> poor and unpredictable performance.
> In the cloud, from my point of view, the VM exists in three scenarios:
> 1, dedicated hardware(in this case, VM = Physical Machine);
> 2, part of dedicated hardware(using capping, like Amazon EC2 T2.small 
> instance);
> 3, sharing with other VMs on the same hardware;
> Both case#2 and case#3 will be influenced due to the issue I mentioned.
>> Isn't there anything in place in Xen or Linux (the latter being better
>> suitable for something like this, IMHO) to compensate for that?
> No. I do not think so. I think this is a bug in Linux kernel under
> virtualization(vmm platform is Xen).
>> I have to admit I haven't really ever checked myself, maybe either
>> George or our Linux people do know more?
> The issue behind it is that the process execution calculation(e.g.,
> delta_exec) in virtualized environment should not be calculated as it
> did in physical enviroment.
> Here are two solutions to fix it:
> 1) Based on the vcpu->runstate.time(running/runnable/block/offline)
> changes, to determine how much time the process on this VCPU is
> running, instead of just "delta_exec = now - exec_start";
> 2) Build another clock inside the guest OS which records the exect
> time that the VCPU runs. All vruntime calculation is based on this
> clock, instead of hyperivosr clock/time(real clock).
> Thanks.


Here is what redhat did in KVM to fix the steal time accounting issue
in guest OS. Hoping Xen can fix this issue in the future.

>>> This issue will make a great performance harm to the victim process.
>>> If the process is an I/O-bound workload, its throughput and latency
>>> will be influenced. If the process is a CPU-bound workload, this
>>> issue
>>> will make its vruntime "unfair" compared to other processes under
>>> CFS.
>>> Because the CPU resource of some type VMs in the cloud are limited as
>>> the above describes(like Amazon EC2 t2.small instance), I doubt that
>>> will also harm the performance of public cloud instances.
>>> ---------------------------------------
>>> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux
>>> 3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux
>>> 3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels
>>> all have this issue.
>>> Please confirm this bug. Thanks.
>> --
>> <<This happens because I choose it to happen!>> (Raistlin Majere)
>> -----------------------------------------------------------------
>> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> --
> Tony. S
> Ph. D student of University of Colorado, Colorado Springs

Tony. S
Ph. D student of University of Colorado, Colorado Springs

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.