[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] BUG: NOW() seems to (sometimes) go backwards!



On 07/06/16 16:54, Dario Faggioli wrote:
> Hi,
> 
> I've been fighting with this during the past few days, while testing
> and benchmarking some of the modification I'm doing to Credit2.
> 
> In summary, what I see is NOW() going backwards. The issues shows up
> when I call NOW() on different pCPUs, and compare the results (which is
> something the Credit2 scheduler does a lot, and I would like it to be
> able to continue to do so. :-D).
> 
> I can elaborate on why this is really really bad for scheduling in
> general, and even more so for Credit2. For now, let me just say it's
> pretty bad. Feel free to ask for details if interested.
> 
> In any case, IMO, we need to fix this.
> 
> It does not only happen cross-socket, i.e., it happens even if the two
> pCPUs are part of the same socket (in the example shown below, they're
> two SMT threads of the same core!!).
> 
> I'm able to trigger this on more than just one box, all with invariant
> TSC support. (All Intel, though, I don't have any AMD test hardware
> available right now).
> 
> In some more details.
> 
> The scheduler (both in scheduler.c and in sched_credit{1,2}.c) uses
> NOW() for all the timing and time sampling. In the trace below, at time
> T1 (equal to 1.264582607), we do this:
> 
>  d0v9->start_time = NOW()
> 
> and we are on pCPU 3.
> 
> Then, at time T2, on pCPU2, there is this:
> 
>  if ( d0v9->start_time < NOW() )
>  {
>    /* OMG!! */
>    __trace("csched2:time_backwards ...")
>  }
> 
> and we indeed hit the __trace().
> 
> What the trace point says is that NOW() was 1972180296557 at time T1,
> on pCPU 3, and it is 1972180270044 at time T2 on pCPU2 (i.e., ~26.5
> microsecs in the past).
> 
> I speak about NOW() going backwards because, by looking at both the
> code and the trace, I am positive about the fact that T2 happens later
> than T1 (go to [1], if interested)... So, why NOW() is smaller at T2?
> :-(
> 
>    1.264577304 ||x||||||||||||| d0v10 runstate_change d0v10 running->runnable
>    1.264577611 ||x||||||||||||| d?v? runstate_change d0v5 runnable->running
>    1.264580025 |||x||||||-||||| d0v0 vcpu_block d0v0
>    1.264580795 |||x||||||-||||| d0v0 csched2:schedule current = d0v0, now = 
> 1972180296557
>    1.264581014 |||x||||||-||||| d0v0 csched2:burn_credits d0v0, credit = 
> 9091956, delta = 39433, start_time = 1972180296557
>    1.264581660 |||x||||||-||||| d0v0 sched_switch prev d0, run for 112.503us
>    1.264581809 |||x||||||-||||| d0v0 sched_switch next d0, was runnable for 
> 54.532us, next slice 2000.0us
>    1.264582033 |||x||||||-||||| d0v0 sched_switch prev d0v0 next d0v9
>    1.264582269 |||x||||||-||||| d0v0 runstate_change d0v0 running->blocked
> T1 1.264582607 |||x||||||-||||| d?v? runstate_change d0v9 runnable->running
>    1.264583109 ||x|||||||-||||| d0v5 csched2:runq_insert d0v10, position 0
>    1.264583341 ||x|||||||-||||| d0v5 csched2:runq_tickle_new d0v10, credit = 
> 5691963
>    1.264583907 ||x|||||||-||||| d0v5 csched2:burn_credits d0v5, credit = 
> 8575685, delta = 8241, start_time = 1972180270044
>    1.264584236 ||x|||||||-||||| d0v5 csched2:burn_credits d1v1, credit = 
> 8696613, delta = 12920, start_time = 1972180270044
>    1.264584431 ||x|||||||-||||| d0v5 csched2:tickle_check d1v1, credit = 
> 8696613
>    1.264584745 ||x|||||||-||||| d0v5 csched2:burn_credits d1v3, credit = 
> 10182550, delta = 12920, start_time = 1972180270044
>    1.264584966 ||x|||||||-||||| d0v5 csched2:tickle_check d1v3, credit = 
> 10182550
> T2 1.264585207 ||x|||||||-||||| d0v5 csched2:time_backwards d0v9: cur_time = 
> 1972180270044, cur_cpu = 2, old_time = 1972180296557, old_cpu = 3
>    1.264585364 ||x|||||||-||||| d0v5 csched2:burn_credits d0v9, credit = 
> 5759499, delta = -26513, start_time = 1972180270044
>    1.264585537 ||x|||||||-||||| d0v5 csched2:tickle_check d0v9, credit = 
> 5759499
>    1.264585876 ||x|||||||-||||| d0v5 csched2:burn_credits d0v7, credit = 
> 8835314, delta = 12920, start_time = 1972180270044
>    1.264586041 ||x|||||||-||||| d0v5 csched2:tickle_check d0v7, credit = 
> 8835314
>    1.264586491 ||x|||||||-||||| d0v5 csched2:burn_credits d0v15, credit = 
> 5822635, delta = 12920, start_time = 1972180270044
>    1.264586655 ||x|||||||-||||| d0v5 csched2:tickle_check d0v15, credit = 
> 5822635
>    1.264586952 ||x|||||||-||||| d0v5 csched2:burn_credits d0v13, credit = 
> 7956174, delta = 12920, start_time = 1972180270044
>    1.264587115 ||x|||||||-||||| d0v5 csched2:tickle_check d0v13, credit = 
> 7956174
>    1.264587385 ||x|||||||-||||| d0v5 csched2:burn_credits d0v3, credit = 
> 7870338, delta = 12920, start_time = 1972180270044
>    1.264587545 ||x|||||||-||||| d0v5 csched2:tickle_check d0v3, credit = 
> 7870338
>    1.264595859 |||x||||||-||||| d0v9 csched2:schedule current = d0v9, now = 
> 1972180311345
>    1.264596313 |||x||||||-||||| d0v9 csched2:burn_credits d0v9, credit = 
> 5718198, delta = 4877, start_time = 1972180311345
> 
> Interestingly enough, if we look at the same (well, sort of) set of
> events with just xentrace_format, we, after having sorted the records
> by their timestamps, get this:
> 
>    CPU2  5613124777311 (+     507)  running_to_runnable [ dom:vcpu = 
> 0x0000000a ]
>    CPU2  5613124778046 (+     735)  runnable_to_running [ dom:vcpu = 
> 0x00000005 ]
> 
>    CPU3  5613124788657 (+     537)  __enter_scheduler [ prev<dom:vcpu> = 
> 0x00000000, next<dom:vcpu> = 0x00000009 ]
>    CPU3  5613124789224 (+     567)  running_to_blocked  [ dom:vcpu = 
> 0x00000000 ]
> T1 CPU3  5613124790034 (+     810)  runnable_to_running [ dom:vcpu = 
> 0x00000009 ]
> 
>    CPU2  5613124791240 (+   13194)  csched2:runq_pos       [ dom:vcpu = 
> 0x0000000a, pos = 0]
>    CPU2  5613124791795 (+     555)  csched2:tickle_new     [ dom:vcpu = 
> 0x0000000a, credit = 0x5691963 ]
>    CPU2  5613124793154 (+    1359)  csched2:credit burn    [ dom:vcpu = 
> 0x2f1abbdc, credit = 459, delta = 5 ]
>    CPU2  5613124793943 (+     789)  csched2:credit burn    [ dom:vcpu = 
> 0x2f1abbdc, credit = 459, delta = 65537 ]
>    CPU2  5613124794411 (+     468)  csched2:tickle_check   [ dom:vcpu = 
> 0x00010001, credit = 8696613 ]
>    CPU2  5613124795164 (+     753)  csched2:credit burn    [ dom:vcpu = 
> 0x2f1abbdc, credit = 459, delta = 65539 ]
>    CPU2  5613124795695 (+     531)  csched2:tickle_check   [ dom:vcpu = 
> 0x00010003, credit = 10182550 ]
> T2 CPU2  5613124796274 (+     579)  unknown (0x000000000002221e)  [ 
> 0x2f1abbdc 0x000001cb 0x2f1b236d 0x000001cb 0x00000009 0x00030002 0x00000000 ]
>    CPU2  5613124796649 (+     375)  csched2:credit burn    [ dom:vcpu = 
> 0x2f1abbdc, credit = 459, delta = 9 ]
>    CPU2  5613124797066 (+     417)  csched2:tickle_check   [ dom:vcpu = 
> 0x00000009, credit = 5759499 ]
>    CPU2  5613124797879 (+     813)  csched2:credit burn    [ dom:vcpu = 
> 0x2f1abbdc, credit = 459, delta = 7 ]
>    CPU2  5613124798275 (+     396)  csched2:tickle_check   [ dom:vcpu = 
> 0x00000007, credit = 8835314 ]
>    CPU2  5613124799355 (+    1080)  csched2:credit burn    [ dom:vcpu = 
> 0x2f1abbdc, credit = 459, delta = 15 ]
>    CPU2  5613124799748 (+     393)  csched2:tickle_check   [ dom:vcpu = 
> 0x0000000f, credit = 5822635 ]
>    CPU2  5613124800459 (+     711)  csched2:credit burn    [ dom:vcpu = 
> 0x2f1abbdc, credit = 459, delta = 13 ]
>    CPU2  5613124800852 (+     393)  csched2:tickle_check   [ dom:vcpu = 
> 0x0000000d, credit = 7956174 ]
>    CPU2  5613124801500 (+     648)  csched2:credit burn    [ dom:vcpu = 
> 0x2f1abbdc, credit = 459, delta = 3 ]
>    CPU2  5613124801884 (+     384)  csched2:tickle_check   [ dom:vcpu = 
> 0x00000003, credit = 7870338 ]
> 
>    CPU3  5613124803177 (+   13143)  csched2:update_load
>    CPU3  5613124803786 (+     609)  unknown (0x0000000000022220)  [ 
> 0x0000d61e 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
>    CPU3  5613124804560 (+     774)  csched2:updt_runq_load [ 
> rq_load[16]:rq_id[8]:wshift[8] = 0x12000009, rq_avgload = 0x0000000000216566, 
> b_avgload = 0x0000000000216566,
>    CPU3  5613124804929 (+     369)  unknown (0x0000000000022220)  [ 
> 0x0001e8de 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
>    CPU3  5613124805256 (+     327)  csched2:updt_vcpu_load [ dom:vcpu = 
> 0x00000000, vcpuload = 0x0000000000026040, wshift = 18 ]
> 
>    CPU2  5613124975596 (+  173712)  domain_wake       [ dom:vcpu = 0x00010005 
> ]
> 
> Interestingly, the timestamp at T1 is 5613124790034, which looks
> _correctly_ smaller than the timestamp at T2 (5613124796274).
> 
> NOW(), as it is called from the scheduling code, does this:
>  s_time_t NOW(u64 at_tsc)  //it's actually get_s_time_fixed()
>  {
>      struct cpu_time *t = &this_cpu(cpu_time);
>      u64 tsc, delta;
>      s_time_t now;
> 
>      tsc = rdtsc();
>      delta = tsc - t->local_tsc_stamp;
>      now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
> 
>      return now;
>  }
> 
> While the timestamps in the trace are the result of get_cycles():
>  static inline cycles_t get_cycles(void)
>  {
>      return rdtsc();
>  }
> 
> So, it looks to me that the TSC is actually ok, and it could be the
> local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
> (which, AFAICUI, is there to convert cycles to time) that does not
> guarantee the result to be monotonic, even if the input is... Thoughts?

What about adding a trace point inside NOW() to record the various
inputs into the calculation?

Adding some sort of global ASSERT() that time doesn't go backwards by
too much would be nice, but I think would require spinlocks -- OTOH,
just for debugging purposes it might not be that bad.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.