[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.



Hey George and Dario,

I am looking at a particular issue for which I would appreciate
your expert knowledge of the scheduler's inner workings.

I have two questions to which I hope will steer me in the right
debug direction. But before I go that route let me explain
the scenario.

There are seven guests (eight if you include initial domain)-
each guest is partitioned to have exclusive usage of a range 
of CPUs. That is, none of the guests (except initial domain)
overlap with other ones. This is enforced by having 'cpu=X-Y'
in the guest config.

As such the runqueue for each physical CPU contains the
idle domain, the guest vcpu, and then occasionally the initial
domain.

What we are observing is that if a domain is idle its steal
time* goes up. My first thought was - well that is the initial
domain taking the time - but after looking at the trace
did not see the initial domain to be scheduled at all.

(*steal time: RUNSTATE_runnable + RUNSTATE_offline).

The issue I see is that the guest ends up hitting
'HLT' and the scheduler picks the idle domain. It
kicks it off and .... 

        Question 1: Following the code path, schedule_tail
        for the idle domain would call idle_loop.

        How do we end up from idle_loop in vcpu_wake?

        Is that because the HPET (on another CPU)
        has raised the softirq(TIMER_SOFTIRQ) because the
        timer has expired?

        It certinaly looks like it could happen:

        evt_do_broadcast (on some other CPU), sets TIMER_SOFTIRQ 

        idle_loop
          ->do_softirq
               timer_softirq_action
                  vcpu_singleshot_timer_fn
                     vcpu_periodic_timer_work
                        send_timer_event->
                            send_guest_vcpu_virq->
                                 evtchn_set_pending->
                                      vcpu_mark_events_pending->
                                        vcpu_kick->
                                                vcpu_unblock->
                                                        vcpu_wake

        And then on the next iteration, somebody has to
        set the (SCHEDULE_SOFTIRQ)??

        Anyhow,

.. and ends up calling vcpu_wake. vcpu_wake
changes the guest's runstate from runstate_BLOCKED
to runstate_RUNNABLE. Then we call schdule()

        Question 2:

        Who would trigger the SCHEDULE_SOFTIRQ for that?
        I was initially thinking that the 'do_block'. But that
        I think triggers the first call to 'schedule' which
        sets the idle domain to run. Help? It could be
        'vcpu_kick' but 'v->running=0' (done by schedule->context_saved).
        Help!? Who could it be?

Then 'schedule' is called where the 'prev' is the idle
domain and 'next' is the guest. However, because 'next' got
labelled as 'runstate_RUNNABLE' we account _all of the time
that the idle domain had been running as belonging to the guest_.

That is what I think is happening and it certainly explains
why the guest has such a large steal time. It looks like a bug
to me, but perhaps that is how it is suppose to be accounted for?

Here is the trace (I am also attaching a patch ,as the 'format'
file had a bug in it)

Upon closer inspection we have something like this:

 (+    1104)  do_yield          [ domid = 0x00000005, edomid = 0x00000004 ]
 (+  353598)  continue_running    [ dom:vcpu = 0x00050004 ]
 (+   41577)  VMENTRY
 (+   77715)  VMEXIT      [ exitcode = 0x0000000c, rIP  = 0xffffffff810402ea ]
 (+   47451)  do_block          [ domid = 0x00000005, vcpuid = 0x00000004 ]
 (+       0)  HLT         [ intpending = 0 ]
 (+  190338)  switch_infprev    [ old_domid = 0x00000005, runtime = 24803749 ]
 (+     264)  switch_infnext    [ new_domid = 0x00007fff, time = 24803749, 
r_time = 4294967295 ]
 (+     309)  __enter_scheduler [ prev<domid:vcpu> = 0x00000005 : 0x00000004, 
next<domid:vcpu> = 0x00007fff : 0x00000081 ]
 (+     318)  running_to_blocked  [ dom:vcpu = 0x00050004 ]

[VM 05 is now in runstate_blocked]
 (+   30516)  runnable_to_running [ dom:vcpu = 0x7fff0081 ]

=> Here schedule_tail is called for the idle domain. <= presumarily runs 
'idle_loop'

... some time later...


vcpu_wake:
 (+  794301)  blocked_to_runnable [ dom:vcpu = 0x00050004 ]

=> Moves VM05 from RUNstate_blocked to RUNstate_runnable.

 (+   88113)  domain_wake       [ domid = 0x00000005, vcpu = 0x00000004 ]


 (+15749685)  switch_infprev    [ old_domid = 0x00007fff, runtime = 6863168 ]

(IDLE domain ran for [6.57msec] !! )

 (+     288)  switch_infnext    [ new_domid = 0x00000005, time = 6458163, 
r_time = 30000000 ]

And we switch to VM05:

 TRACE_3D(TRC_SCHED_SWITCH_INFNEXT,
   VM05 ==>  next->domain->domain_id,   
   YES?!?!?  (next->runstate.state == RUNSTATE_runnable) ? [yes, because 
vcpu_wake changed it]
             (now - next->runstate.state_entry_time) : 0,
             next_slice.time);

 And account the time that the idle domain ran to the VCPU of
the VM05.

 (+   27717)  __enter_scheduler [ prev<domid:vcpu> = 0x00007fff : 0x00000081, 
next<domid:vcpu> = 0x00000005 : 0x00000004 ]
 (+     363)  running_to_runnable [ dom:vcpu = 0x7fff0081 ]
 (+   63174)  runnable_to_running [ dom:vcpu = 0x00050004 ]
 (+       0)  INJ_VIRQ    [ vector = 0xf3, fake = 0 ]
 (+       0)  INTR_WINDOW [ value = 0x000000f3 ]
 (+  164763)  VMENTRY

[Start VM05 injecting the callback vector. Presumarily with
the VIRQ_TIMER event]

Attachment: 0001-xentrace-formats-Fix-TRC_SCHED_-outputs.patch
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.