[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] BUG: NOW() seems to (sometimes) go backwards!
On 07/06/16 16:54, Dario Faggioli wrote: > Hi, > > I've been fighting with this during the past few days, while testing > and benchmarking some of the modification I'm doing to Credit2. > > In summary, what I see is NOW() going backwards. The issues shows up > when I call NOW() on different pCPUs, and compare the results (which is > something the Credit2 scheduler does a lot, and I would like it to be > able to continue to do so. :-D). > > I can elaborate on why this is really really bad for scheduling in > general, and even more so for Credit2. For now, let me just say it's > pretty bad. Feel free to ask for details if interested. > > In any case, IMO, we need to fix this. > > It does not only happen cross-socket, i.e., it happens even if the two > pCPUs are part of the same socket (in the example shown below, they're > two SMT threads of the same core!!). > > I'm able to trigger this on more than just one box, all with invariant > TSC support. (All Intel, though, I don't have any AMD test hardware > available right now). > > In some more details. > > The scheduler (both in scheduler.c and in sched_credit{1,2}.c) uses > NOW() for all the timing and time sampling. In the trace below, at time > T1 (equal to 1.264582607), we do this: > > d0v9->start_time = NOW() > > and we are on pCPU 3. > > Then, at time T2, on pCPU2, there is this: > > if ( d0v9->start_time < NOW() ) > { > /* OMG!! */ > __trace("csched2:time_backwards ...") > } > > and we indeed hit the __trace(). > > What the trace point says is that NOW() was 1972180296557 at time T1, > on pCPU 3, and it is 1972180270044 at time T2 on pCPU2 (i.e., ~26.5 > microsecs in the past). > > I speak about NOW() going backwards because, by looking at both the > code and the trace, I am positive about the fact that T2 happens later > than T1 (go to [1], if interested)... So, why NOW() is smaller at T2? > :-( > > 1.264577304 ||x||||||||||||| d0v10 runstate_change d0v10 running->runnable > 1.264577611 ||x||||||||||||| d?v? runstate_change d0v5 runnable->running > 1.264580025 |||x||||||-||||| d0v0 vcpu_block d0v0 > 1.264580795 |||x||||||-||||| d0v0 csched2:schedule current = d0v0, now = > 1972180296557 > 1.264581014 |||x||||||-||||| d0v0 csched2:burn_credits d0v0, credit = > 9091956, delta = 39433, start_time = 1972180296557 > 1.264581660 |||x||||||-||||| d0v0 sched_switch prev d0, run for 112.503us > 1.264581809 |||x||||||-||||| d0v0 sched_switch next d0, was runnable for > 54.532us, next slice 2000.0us > 1.264582033 |||x||||||-||||| d0v0 sched_switch prev d0v0 next d0v9 > 1.264582269 |||x||||||-||||| d0v0 runstate_change d0v0 running->blocked > T1 1.264582607 |||x||||||-||||| d?v? runstate_change d0v9 runnable->running > 1.264583109 ||x|||||||-||||| d0v5 csched2:runq_insert d0v10, position 0 > 1.264583341 ||x|||||||-||||| d0v5 csched2:runq_tickle_new d0v10, credit = > 5691963 > 1.264583907 ||x|||||||-||||| d0v5 csched2:burn_credits d0v5, credit = > 8575685, delta = 8241, start_time = 1972180270044 > 1.264584236 ||x|||||||-||||| d0v5 csched2:burn_credits d1v1, credit = > 8696613, delta = 12920, start_time = 1972180270044 > 1.264584431 ||x|||||||-||||| d0v5 csched2:tickle_check d1v1, credit = > 8696613 > 1.264584745 ||x|||||||-||||| d0v5 csched2:burn_credits d1v3, credit = > 10182550, delta = 12920, start_time = 1972180270044 > 1.264584966 ||x|||||||-||||| d0v5 csched2:tickle_check d1v3, credit = > 10182550 > T2 1.264585207 ||x|||||||-||||| d0v5 csched2:time_backwards d0v9: cur_time = > 1972180270044, cur_cpu = 2, old_time = 1972180296557, old_cpu = 3 > 1.264585364 ||x|||||||-||||| d0v5 csched2:burn_credits d0v9, credit = > 5759499, delta = -26513, start_time = 1972180270044 > 1.264585537 ||x|||||||-||||| d0v5 csched2:tickle_check d0v9, credit = > 5759499 > 1.264585876 ||x|||||||-||||| d0v5 csched2:burn_credits d0v7, credit = > 8835314, delta = 12920, start_time = 1972180270044 > 1.264586041 ||x|||||||-||||| d0v5 csched2:tickle_check d0v7, credit = > 8835314 > 1.264586491 ||x|||||||-||||| d0v5 csched2:burn_credits d0v15, credit = > 5822635, delta = 12920, start_time = 1972180270044 > 1.264586655 ||x|||||||-||||| d0v5 csched2:tickle_check d0v15, credit = > 5822635 > 1.264586952 ||x|||||||-||||| d0v5 csched2:burn_credits d0v13, credit = > 7956174, delta = 12920, start_time = 1972180270044 > 1.264587115 ||x|||||||-||||| d0v5 csched2:tickle_check d0v13, credit = > 7956174 > 1.264587385 ||x|||||||-||||| d0v5 csched2:burn_credits d0v3, credit = > 7870338, delta = 12920, start_time = 1972180270044 > 1.264587545 ||x|||||||-||||| d0v5 csched2:tickle_check d0v3, credit = > 7870338 > 1.264595859 |||x||||||-||||| d0v9 csched2:schedule current = d0v9, now = > 1972180311345 > 1.264596313 |||x||||||-||||| d0v9 csched2:burn_credits d0v9, credit = > 5718198, delta = 4877, start_time = 1972180311345 > > Interestingly enough, if we look at the same (well, sort of) set of > events with just xentrace_format, we, after having sorted the records > by their timestamps, get this: > > CPU2 5613124777311 (+ 507) running_to_runnable [ dom:vcpu = > 0x0000000a ] > CPU2 5613124778046 (+ 735) runnable_to_running [ dom:vcpu = > 0x00000005 ] > > CPU3 5613124788657 (+ 537) __enter_scheduler [ prev<dom:vcpu> = > 0x00000000, next<dom:vcpu> = 0x00000009 ] > CPU3 5613124789224 (+ 567) running_to_blocked [ dom:vcpu = > 0x00000000 ] > T1 CPU3 5613124790034 (+ 810) runnable_to_running [ dom:vcpu = > 0x00000009 ] > > CPU2 5613124791240 (+ 13194) csched2:runq_pos [ dom:vcpu = > 0x0000000a, pos = 0] > CPU2 5613124791795 (+ 555) csched2:tickle_new [ dom:vcpu = > 0x0000000a, credit = 0x5691963 ] > CPU2 5613124793154 (+ 1359) csched2:credit burn [ dom:vcpu = > 0x2f1abbdc, credit = 459, delta = 5 ] > CPU2 5613124793943 (+ 789) csched2:credit burn [ dom:vcpu = > 0x2f1abbdc, credit = 459, delta = 65537 ] > CPU2 5613124794411 (+ 468) csched2:tickle_check [ dom:vcpu = > 0x00010001, credit = 8696613 ] > CPU2 5613124795164 (+ 753) csched2:credit burn [ dom:vcpu = > 0x2f1abbdc, credit = 459, delta = 65539 ] > CPU2 5613124795695 (+ 531) csched2:tickle_check [ dom:vcpu = > 0x00010003, credit = 10182550 ] > T2 CPU2 5613124796274 (+ 579) unknown (0x000000000002221e) [ > 0x2f1abbdc 0x000001cb 0x2f1b236d 0x000001cb 0x00000009 0x00030002 0x00000000 ] > CPU2 5613124796649 (+ 375) csched2:credit burn [ dom:vcpu = > 0x2f1abbdc, credit = 459, delta = 9 ] > CPU2 5613124797066 (+ 417) csched2:tickle_check [ dom:vcpu = > 0x00000009, credit = 5759499 ] > CPU2 5613124797879 (+ 813) csched2:credit burn [ dom:vcpu = > 0x2f1abbdc, credit = 459, delta = 7 ] > CPU2 5613124798275 (+ 396) csched2:tickle_check [ dom:vcpu = > 0x00000007, credit = 8835314 ] > CPU2 5613124799355 (+ 1080) csched2:credit burn [ dom:vcpu = > 0x2f1abbdc, credit = 459, delta = 15 ] > CPU2 5613124799748 (+ 393) csched2:tickle_check [ dom:vcpu = > 0x0000000f, credit = 5822635 ] > CPU2 5613124800459 (+ 711) csched2:credit burn [ dom:vcpu = > 0x2f1abbdc, credit = 459, delta = 13 ] > CPU2 5613124800852 (+ 393) csched2:tickle_check [ dom:vcpu = > 0x0000000d, credit = 7956174 ] > CPU2 5613124801500 (+ 648) csched2:credit burn [ dom:vcpu = > 0x2f1abbdc, credit = 459, delta = 3 ] > CPU2 5613124801884 (+ 384) csched2:tickle_check [ dom:vcpu = > 0x00000003, credit = 7870338 ] > > CPU3 5613124803177 (+ 13143) csched2:update_load > CPU3 5613124803786 (+ 609) unknown (0x0000000000022220) [ > 0x0000d61e 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ] > CPU3 5613124804560 (+ 774) csched2:updt_runq_load [ > rq_load[16]:rq_id[8]:wshift[8] = 0x12000009, rq_avgload = 0x0000000000216566, > b_avgload = 0x0000000000216566, > CPU3 5613124804929 (+ 369) unknown (0x0000000000022220) [ > 0x0001e8de 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ] > CPU3 5613124805256 (+ 327) csched2:updt_vcpu_load [ dom:vcpu = > 0x00000000, vcpuload = 0x0000000000026040, wshift = 18 ] > > CPU2 5613124975596 (+ 173712) domain_wake [ dom:vcpu = 0x00010005 > ] > > Interestingly, the timestamp at T1 is 5613124790034, which looks > _correctly_ smaller than the timestamp at T2 (5613124796274). > > NOW(), as it is called from the scheduling code, does this: > s_time_t NOW(u64 at_tsc) //it's actually get_s_time_fixed() > { > struct cpu_time *t = &this_cpu(cpu_time); > u64 tsc, delta; > s_time_t now; > > tsc = rdtsc(); > delta = tsc - t->local_tsc_stamp; > now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale); > > return now; > } > > While the timestamps in the trace are the result of get_cycles(): > static inline cycles_t get_cycles(void) > { > return rdtsc(); > } > > So, it looks to me that the TSC is actually ok, and it could be the > local_tsc_stamp and scale_delta() magic done in get_s_time_fixed() > (which, AFAICUI, is there to convert cycles to time) that does not > guarantee the result to be monotonic, even if the input is... Thoughts? What about adding a trace point inside NOW() to record the various inputs into the calculation? Adding some sort of global ASSERT() that time doesn't go backwards by too much would be nice, but I think would require spinlocks -- OTOH, just for debugging purposes it might not be that bad. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |