Xen project Mailing List

Re: [Xen-devel] [BUG] Bugs existing Xen's credit scheduler cause long tail latency issues

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Date: Mon, 16 May 2016 12:22:56 -0600

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>

Delivery-date: Mon, 16 May 2016 18:23:52 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, May 16, 2016 at 5:30 AM, Dario Faggioli <dario.faggioli@xxxxxxxxxx> wrote: > [Adding George, and avoiding trimming, for his benefit] > > On Sat, 2016-05-14 at 22:11 -0600, Tony S wrote: >> Hi all, >> > Hi Tony, > >> When I was running latency-sensitive applications in VMs on Xen, I >> found some bugs in the credit scheduler which will cause long tail >> latency in I/O-intensive VMs. >> > Ok, first of all, thanks for looking into and reporting this. > > This is certainly something we need to think about... For now, just a > couple of questions. Hi Dario, Thank you for your reply. :-) > >> (1) Problem description >> >> ------------Description------------ >> My test environment is as follows: Hypervisor(Xen 4.5.0), Dom 0(Linux >> 3.18.21), Dom U(Linux 3.18.21). >> >> Environment setup: >> We created two 1-vCPU, 4GB-memory VMs and pinned them onto one >> physical CPU core. One VM(denoted as I/O-VM) ran Sockperf server >> program; the other VM ran a compute-bound task, e.g., SPECCPU 2006 or >> simply a loop(denoted as CPU-VM). A client on another physical >> machine >> sent UDP requests to the I/O-VM. >> > So, just to be sure I've understood, you have 2 VMs, each with 1 vCPU, > *both* pinned on the *same* pCPU, is this the case? > Yes. >> Here are my tail latency results (micro-second): >> Case Avg 90% 99% 99.9% 99.99% >> #1 108 & 114 & 128 & 129 & 130 >> #2 7811 & 13892 & 14874 & 15315 & 16383 >> #3 943 & 131 & 21755 & 26453 & 26553 >> #4 116 & 96 & 105 & 8217 & 13472 >> #5 116 & 117 & 129 & 131 & 132 >> >> Bug 1, 2, and 3 will be discussed below. >> >> Case #1: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> idling (no processes running). >> >> Case #2: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 >> >> Case #3: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 fixed >> >> Case #4: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 fixed >> >> Case #5: >> I/O-VM was processing Sockperf requests from clients; CPU-VM was >> running a compute-bound task. >> Hypervisor is the native Xen 4.5.0 with bug 1 & 2 & 3 fixed >> >> --------------------------------------- >> >> >> (2) Problem analysis >> >> ------------Analysis---------------- >> >> [Bug1]: The VCPU that ran CPU-intensive workload could be mistakenly >> boosted due to CPU affinity. >> >> http://lists.xenproject.org/archives/html/xen-devel/2015-10/msg02853. >> html >> >> We have already discussed this bug and a potential patch in the above >> link. Although the discussed patch improved the tail latency, i.e., >> reducing the 90th percentile latency, the long tail latency is till >> not bounded. Next, we discussed two new bugs that inflict latency >> hike >> at the very far end of the tail. >> > Right, and there is a fix upstream for this. It's not the patch you > proposed in the thread linked above, but it should have had the same > effect. > > Can you perhaps try something more recent thatn 4.5 (4.7-rc would be > great) and confirm that the number still look similar? I have tried the latest stable version Xen 4.6 today. Here is my results: Case Avg 90% 99% 99.9% 99.99% #1 91 & 93 & 101 & 105 & 110 #2 22506 & 43011 & 231946 & 259501 & 265561 #3 917 & 95 & 25257 & 30048 & 30756 #4 110 & 95 & 102 & 12448 & 13255 #5 114 & 118 & 130 & 134 & 136 It seems that case#2 is much worse. The other cases are similar. My raw latency data is pasted below. For xen 4.7-rc, I have some installment issues on my machine, therefore I have not tried that. Raw data is as follows. Hope this could help you understand the issues better. :-) # case 1: sockperf: ====> avg-lat= 91.688 (std-dev=2.950) sockperf: ---> <MAX> observation = 110.647 sockperf: ---> percentile 99.99 = 110.647 sockperf: ---> percentile 99.90 = 105.242 sockperf: ---> percentile 99.50 = 101.531 sockperf: ---> percentile 99.00 = 101.066 sockperf: ---> percentile 95.00 = 97.016 sockperf: ---> percentile 90.00 = 93.294 sockperf: ---> percentile 75.00 = 92.157 sockperf: ---> percentile 50.00 = 91.437 sockperf: ---> percentile 25.00 = 90.786 sockperf: ---> <MIN> observation = 73.071 # case 2: sockperf: ====> avg-lat=90019.931 (std-dev=136620.722) sockperf: ---> <MAX> observation = 637712.152 sockperf: ---> percentile 99.99 = 637712.152 sockperf: ---> percentile 99.90 = 632901.547 sockperf: ---> percentile 99.50 = 615972.778 sockperf: ---> percentile 99.00 = 599698.318 sockperf: ---> percentile 95.00 = 428857.020 sockperf: ---> percentile 90.00 = 259316.760 sockperf: ---> percentile 75.00 = 114029.044 sockperf: ---> percentile 50.00 = 24629.429 sockperf: ---> percentile 25.00 = 10368.731 sockperf: ---> <MIN> observation = 81.046 #case 3: sockperf: ====> avg-lat=917.394 (std-dev=3943.142) sockperf: ---> <MAX> observation = 30756.289 sockperf: ---> percentile 99.99 = 30756.289 sockperf: ---> percentile 99.90 = 30048.372 sockperf: ---> percentile 99.50 = 25962.687 sockperf: ---> percentile 99.00 = 25257.746 sockperf: ---> percentile 95.00 = 5615.028 sockperf: ---> percentile 90.00 = 95.726 sockperf: ---> percentile 75.00 = 92.916 sockperf: ---> percentile 50.00 = 90.387 sockperf: ---> percentile 25.00 = 89.162 sockperf: ---> <MIN> observation = 67.762 #case 4: sockperf: ====> avg-lat=110.159 (std-dev=555.153) sockperf: ---> <MAX> observation = 13255.732 sockperf: ---> percentile 99.99 = 13255.732 sockperf: ---> percentile 99.90 = 12448.629 sockperf: ---> percentile 99.50 = 104.799 sockperf: ---> percentile 99.00 = 101.954 sockperf: ---> percentile 95.00 = 97.295 sockperf: ---> percentile 90.00 = 95.995 sockperf: ---> percentile 75.00 = 91.866 sockperf: ---> percentile 50.00 = 88.803 sockperf: ---> percentile 25.00 = 71.088 sockperf: ---> <MIN> observation = 65.826 #case 5: sockperf: ====> avg-lat=114.984 (std-dev=3.782) sockperf: ---> <MAX> observation = 136.748 sockperf: ---> percentile 99.99 = 136.748 sockperf: ---> percentile 99.90 = 134.192 sockperf: ---> percentile 99.50 = 131.467 sockperf: ---> percentile 99.00 = 130.200 sockperf: ---> percentile 95.00 = 121.575 sockperf: ---> percentile 90.00 = 118.518 sockperf: ---> percentile 75.00 = 116.343 sockperf: ---> percentile 50.00 = 114.356 sockperf: ---> percentile 25.00 = 112.479 sockperf: ---> <MIN> observation = 94.932 > > About this below here, I'll read carefully and think about it. Thanks > again. Thank you, Dario. For bug 2 and bug 3, although they will not influence the throughput, latency, especially long tail latency is a big issue due to them. > >> [Bug2]: In csched_acct() (by default every 30ms), a VCPU stops >> earning >> credits and is removed from the active CPU list(in >> __csched_vcpu_acct_stop_locked) if its credit is larger than the >> upper >> bound. Because the domain has only one VCPU and the VM will also be >> removed from the active domain list. >> >> Every 10ms, csched_tick() --> csched_vcpu_acct() --> >> __csched_vcpu_acct_start() will be executed and tries to put inactive >> VCPUs back to the active list. However, __csched_vcpu_acct_start() >> will only put the current VCPU back to the active list. If an >> I/O-bound VCPU is not the current VCPU at the csched_tick(), it will >> not be put back to the active VCPU list. If so, the I/O-bound VCPU >> will likely miss the next credit refill in csched_acct() and can >> easily enter the OVER state. As such, the I/O-bound VM will be unable >> to be boosted and have very long latency. It takes at least one time >> slice (e.g., 30ms) before the I/O VM is activated and starts to >> receive credits. >> >> [Possible Solution] Try to activate any inactive VCPUs back to active >> before next credit refill, instead of just the current VCPU. >> >> >> >> [Bug 3]: The BOOST priority might be changed to UNDER before the >> boosted VCPU preempts the current running VCPU. If so, VCPU boosting >> can not take effect. >> >> If a VCPU is in UNDER state and wakes up from sleep, it will be >> boosted in csched_vcpu_wake(). However, the boosting is successful >> only when __runq_tickle() preempts the current VCPU. It is possible >> that csched_acct() can run between csched_vcpu_wake() and >> __runq_tickle(), which will sometimes change the BOOST state back to >> UNDER if credit >0. If so, __runq_tickle() can fail as VCPUs in UNDER >> cannot preempt another UNDER VCPU. This also contributes to the far >> end of the long tail latency. >> >> [Possible Solution] >> 1. add a lock to prevent csched_acct() from interleaving with >> csched_vcpu_wake(); >> 2. separate the BOOST state from UNDER and OVER states. >> --------------------------------------- >> >> >> Please confirm these bugs. >> Thanks. >> >> -- >> Tony. S >> Ph. D student of University of Colorado, Colorado Springs >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@xxxxxxxxxxxxx >> http://lists.xen.org/xen-devel > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > -- Tony S. Ph. D student of University of Colorado, Colorado Springs _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.