Xen project Mailing List

[Xen-devel] RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.

To: xen-devel@xxxxxxxxxxxxxxxxxxxx, george.dunlap@xxxxxxxxxxxxx, dario.faggioli@xxxxxxxxxx

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Wed, 23 Apr 2014 17:28:24 -0400

Delivery-date: Wed, 23 Apr 2014 21:28:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hey George and Dario, I am looking at a particular issue for which I would appreciate your expert knowledge of the scheduler's inner workings. I have two questions to which I hope will steer me in the right debug direction. But before I go that route let me explain the scenario. There are seven guests (eight if you include initial domain)- each guest is partitioned to have exclusive usage of a range of CPUs. That is, none of the guests (except initial domain) overlap with other ones. This is enforced by having 'cpu=X-Y' in the guest config. As such the runqueue for each physical CPU contains the idle domain, the guest vcpu, and then occasionally the initial domain. What we are observing is that if a domain is idle its steal time* goes up. My first thought was - well that is the initial domain taking the time - but after looking at the trace did not see the initial domain to be scheduled at all. (*steal time: RUNSTATE_runnable + RUNSTATE_offline). The issue I see is that the guest ends up hitting 'HLT' and the scheduler picks the idle domain. It kicks it off and .... Question 1: Following the code path, schedule_tail for the idle domain would call idle_loop. How do we end up from idle_loop in vcpu_wake? Is that because the HPET (on another CPU) has raised the softirq(TIMER_SOFTIRQ) because the timer has expired? It certinaly looks like it could happen: evt_do_broadcast (on some other CPU), sets TIMER_SOFTIRQ idle_loop ->do_softirq timer_softirq_action vcpu_singleshot_timer_fn vcpu_periodic_timer_work send_timer_event-> send_guest_vcpu_virq-> evtchn_set_pending-> vcpu_mark_events_pending-> vcpu_kick-> vcpu_unblock-> vcpu_wake And then on the next iteration, somebody has to set the (SCHEDULE_SOFTIRQ)?? Anyhow, .. and ends up calling vcpu_wake. vcpu_wake changes the guest's runstate from runstate_BLOCKED to runstate_RUNNABLE. Then we call schdule() Question 2: Who would trigger the SCHEDULE_SOFTIRQ for that? I was initially thinking that the 'do_block'. But that I think triggers the first call to 'schedule' which sets the idle domain to run. Help? It could be 'vcpu_kick' but 'v->running=0' (done by schedule->context_saved). Help!? Who could it be? Then 'schedule' is called where the 'prev' is the idle domain and 'next' is the guest. However, because 'next' got labelled as 'runstate_RUNNABLE' we account _all of the time that the idle domain had been running as belonging to the guest_. That is what I think is happening and it certainly explains why the guest has such a large steal time. It looks like a bug to me, but perhaps that is how it is suppose to be accounted for? Here is the trace (I am also attaching a patch ,as the 'format' file had a bug in it) Upon closer inspection we have something like this: (+ 1104) do_yield [ domid = 0x00000005, edomid = 0x00000004 ] (+ 353598) continue_running [ dom:vcpu = 0x00050004 ] (+ 41577) VMENTRY (+ 77715) VMEXIT [ exitcode = 0x0000000c, rIP = 0xffffffff810402ea ] (+ 47451) do_block [ domid = 0x00000005, vcpuid = 0x00000004 ] (+ 0) HLT [ intpending = 0 ] (+ 190338) switch_infprev [ old_domid = 0x00000005, runtime = 24803749 ] (+ 264) switch_infnext [ new_domid = 0x00007fff, time = 24803749, r_time = 4294967295 ] (+ 309) __enter_scheduler [ prev<domid:vcpu> = 0x00000005 : 0x00000004, next<domid:vcpu> = 0x00007fff : 0x00000081 ] (+ 318) running_to_blocked [ dom:vcpu = 0x00050004 ] [VM 05 is now in runstate_blocked] (+ 30516) runnable_to_running [ dom:vcpu = 0x7fff0081 ] => Here schedule_tail is called for the idle domain. <= presumarily runs 'idle_loop' ... some time later... vcpu_wake: (+ 794301) blocked_to_runnable [ dom:vcpu = 0x00050004 ] => Moves VM05 from RUNstate_blocked to RUNstate_runnable. (+ 88113) domain_wake [ domid = 0x00000005, vcpu = 0x00000004 ] (+15749685) switch_infprev [ old_domid = 0x00007fff, runtime = 6863168 ] (IDLE domain ran for [6.57msec] !! ) (+ 288) switch_infnext [ new_domid = 0x00000005, time = 6458163, r_time = 30000000 ] And we switch to VM05: TRACE_3D(TRC_SCHED_SWITCH_INFNEXT, VM05 ==> next->domain->domain_id, YES?!?!? (next->runstate.state == RUNSTATE_runnable) ? [yes, because vcpu_wake changed it] (now - next->runstate.state_entry_time) : 0, next_slice.time); And account the time that the idle domain ran to the VCPU of the VM05. (+ 27717) __enter_scheduler [ prev<domid:vcpu> = 0x00007fff : 0x00000081, next<domid:vcpu> = 0x00000005 : 0x00000004 ] (+ 363) running_to_runnable [ dom:vcpu = 0x7fff0081 ] (+ 63174) runnable_to_running [ dom:vcpu = 0x00050004 ] (+ 0) INJ_VIRQ [ vector = 0xf3, fake = 0 ] (+ 0) INTR_WINDOW [ value = 0x000000f3 ] (+ 164763) VMENTRY [Start VM05 injecting the callback vector. Presumarily with the VIRQ_TIMER event]

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.