Xen project Mailing List

Re: [Xen-devel] [BUG] Linux process vruntime accounting in Xen

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Date: Mon, 16 May 2016 16:33:11 -0600

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, David Vrabel <david.vrabel@xxxxxxxxxx>, Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 16 May 2016 22:33:45 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, May 16, 2016 at 3:38 PM, Tony S <suokunstar@xxxxxxxxx> wrote: > On Mon, May 16, 2016 at 5:37 AM, Dario Faggioli > <dario.faggioli@xxxxxxxxxx> wrote: >> [Adding George again, and a few Linux/Xen folks] >> >> On Sat, 2016-05-14 at 18:25 -0600, Tony S wrote: >>> In virtualized environments, sometimes we need to limit the CPU >>> resources to a virtual machine(VM). For example in Xen, we use >>> $ xl sched-credit -d 1 -c 50 >>> >>> to limit the CPU resource of dom 1 as half of >>> one physical CPU core. If the VM CPU resource is capped, the process >>> inside the VM will have a vruntime accounting problem. Here, I report >>> my findings about Linux process scheduler under the above scenario. >>> >> Thanks for this other report as well. :-) >> >> All you say makes sense to me, and I will think about it. I'm not sure >> about one thing, though... >> > > Hi Dario, > > Thank you for your reply. > > >>> ------------Description------------ >>> Linux CFS relies on delta_exec to charge the vruntime of processes. >>> The variable delta_exec is the difference of a process starts and >>> stops running on a CPU. This works well in physical machine. However, >>> in virtual machine under capped resources, some processes might be >>> accounted with inaccurate vruntime. >>> >>> For example, suppose we have a VM which has one vCPU and is capped to >>> have as much as 50% of a physical CPU. When process A inside the VM >>> starts running and the CPU resource of that VM runs out, the VM will >>> be paused. Next round when the VM is allocated new CPU resource and >>> starts running again, process A stops running and is put back to the >>> runqueue. The delta_exec of process A is accounted as its "real >>> execution time" plus the paused time of its VM. That will make the >>> vruntime of process A much larger than it should be and process A >>> would not be scheduled again for a long time until the vruntimes of >>> other >>> processes catch it. >>> --------------------------------------- >>> >>> >>> ------------Analysis---------------- >>> When a process stops running and is going to put back to the >>> runqueue, >>> update_curr() will be executed. >>> [src/kernel/sched/fair.c] >>> >>> static void update_curr(struct cfs_rq *cfs_rq) >>> { >>> ... ... >>> delta_exec = now - curr->exec_start; >>> ... ... >>> curr->exec_start = now; >>> ... ... >>> curr->sum_exec_runtime += delta_exec; >>> schedstat_add(cfs_rq, exec_clock, delta_exec); >>> curr->vruntime += calc_delta_fair(delta_exec, curr); >>> update_min_vruntime(cfs_rq); >>> ... ... >>> } >>> >>> "now" --> the right now time >>> "exec_start" --> the time when the current process is put on the CPU >>> "delta_exec" --> the time difference of a process between it starts >>> and stops running on the CPU >>> >>> When a process starts running before its VM is paused and the process >>> stops running after its VM is unpaused, the delta_exec will include >>> the VM suspend time which is pretty large compared to the real >>> execution time of a process. >>> >> ... but would that also apply to a VM that is not scheduled --just >> because of pCPU contention, not because it was paused-- for a few time? >> > > Thanks for your suggestion. I have tried to see whether this issue > exists on pCPU sharing today. Unfortunately, I found this issue was > there, not only for capping case, but also for pCPU sharing case. > > In the above both cases, the process vruntime accounting in guest OS > has "vruntime jump", which might cause that victim process to have > poor and unpredictable performance. > > In the cloud, from my point of view, the VM exists in three scenarios: > 1, dedicated hardware(in this case, VM = Physical Machine); > 2, part of dedicated hardware(using capping, like Amazon EC2 T2.small > instance); > 3, sharing with other VMs on the same hardware; > > Both case#2 and case#3 will be influenced due to the issue I mentioned. > > >> Isn't there anything in place in Xen or Linux (the latter being better >> suitable for something like this, IMHO) to compensate for that? >> > > No. I do not think so. I think this is a bug in Linux kernel under > virtualization(vmm platform is Xen). > >> I have to admit I haven't really ever checked myself, maybe either >> George or our Linux people do know more? > > The issue behind it is that the process execution calculation(e.g., > delta_exec) in virtualized environment should not be calculated as it > did in physical enviroment. > > Here are two solutions to fix it: > > 1) Based on the vcpu->runstate.time(running/runnable/block/offline) > changes, to determine how much time the process on this VCPU is > running, instead of just "delta_exec = now - exec_start"; > > 2) Build another clock inside the guest OS which records the exect > time that the VCPU runs. All vruntime calculation is based on this > clock, instead of hyperivosr clock/time(real clock). > > Thanks. > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-KVM_guest_timing_management-Steal_time_accounting.html Here is what redhat did in KVM to fix the steal time accounting issue in guest OS. Hoping Xen can fix this issue in the future. >> >>> This issue will make a great performance harm to the victim process. >>> If the process is an I/O-bound workload, its throughput and latency >>> will be influenced. If the process is a CPU-bound workload, this >>> issue >>> will make its vruntime "unfair" compared to other processes under >>> CFS. >>> >>> Because the CPU resource of some type VMs in the cloud are limited as >>> the above describes(like Amazon EC2 t2.small instance), I doubt that >>> will also harm the performance of public cloud instances. >>> --------------------------------------- >>> >>> >>> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux >>> 3.18.21), Dom U(Linux 3.18.21). I also test longterm version Linux >>> 3.18.30 and the latest longterm version, Linux 4.4.7. Those kernels >>> all have this issue. >>> >>> Please confirm this bug. Thanks. >>> >>> >> -- >> <<This happens because I choose it to happen!>> (Raistlin Majere) >> ----------------------------------------------------------------- >> Dario Faggioli, Ph.D, http://about.me/dario.faggioli >> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >> > > -- > Tony. S > Ph. D student of University of Colorado, Colorado Springs -- Tony. S Ph. D student of University of Colorado, Colorado Springs _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.