[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 58330: regressions - FAIL



On Wed, 2015-06-10 at 19:15 +0000, osstest service user wrote:
> flight 58330 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/58330/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-armhf-armhf-xl-credit2  15 guest-start/debian.repeat fail REGR. vs. 
> 58264
> 
Ok, let's look at logs.

This should be the one:
http://logs.test-lab.xenproject.org/osstest/logs/58330/test-armhf-armhf-xl-credit2/serial-cubietruck-braque.log


Jun 10 14:17:12.292428 (XEN) NOW=0x00000147D949B2B5
Jun 10 14:17:12.293377 (XEN) Online Cpus: 0-1
Jun 10 14:17:12.368382 (XEN) Cpupool 0:
Jun 10 14:17:12.369884 (XEN) Cpus: 0-1
Jun 10 14:17:12.371388 (XEN) Scheduler: SMP Credit Scheduler rev2 (credit2)
Jun 10 14:17:12.376023 (XEN) Active queues: 1
Jun 10 14:17:12.378132 (XEN)    default-weight     = 256
Jun 10 14:17:12.380970 (XEN) Runqueue 0:
Jun 10 14:17:12.382630 (XEN)    ncpus              = 2
Jun 10 14:17:12.385260 (XEN)    cpus               = 0-1
Jun 10 14:17:12.388131 (XEN)    max_weight         = 256
Jun 10 14:17:12.391003 (XEN)    instload           = 1
Jun 10 14:17:12.393765 (XEN)    aveload            = 100

100% average load. This could mean that one of the two pCPUs in the
runqueue (i.e., one of the two pCPUs in the system, since there is only
one runqueue) is fully loaded.

In theory, it could also mean that each pCPU is 50% loaded, but I think,
in this case, it's like said above.

Jun 10 14:17:12.396629 (XEN)    idlers: 00000000,00000000,00000000,00000001
Jun 10 14:17:12.401129 (XEN)    tickled: 00000000,00000000,00000000,00000000

And it would be pCPU #1 that is running at 100%, as pCPU #0 is idle.

Jun 10 14:17:12.405635 (XEN) Domain info:
Jun 10 14:17:12.407379 (XEN)    Domain: 0 w 256 v 2
Jun 10 14:17:12.409888 (XEN)              1: [0.0] flags=0 cpu=0 
credit=10500000 [w=256]
Jun 10 14:17:12.414999 (XEN)      2: [0.1] flags=0 cpu=1 credit=10500000 [w=256]
Jun 10 14:17:12.419761 (XEN)    Domain: 10 w 256 v 4
Jun 10 14:17:12.422381 (XEN)              3: [10.0] flags=2 cpu=1 
credit=9097256 [w=256]
Jun 10 14:17:12.427387 (XEN)      4: [10.1] flags=0 cpu=0 credit=10500000 
[w=256]
Jun 10 14:17:12.432388 (XEN)      5: [10.2] flags=0 cpu=1 credit=10500000 
[w=256]
Jun 10 14:17:12.437393 (XEN)      6: [10.3] flags=0 cpu=0 credit=10500000 
[w=256]
Jun 10 14:17:12.442434 (XEN) CPU[00]  
sibling=00000000,00000000,00000000,00000001, 
core=00000000,00000000,00000000,00000001
Jun 10 14:17:12.451247 (XEN)    run: [32767.0] flags=0 cpu=0 credit=-1073741824 
[w=0]
Jun 10 14:17:12.456634 (XEN) CPU[01]  
sibling=00000000,00000000,00000000,00000002, 
core=00000000,00000000,00000000,00000002
Jun 10 14:17:12.465504 (XEN)    run: [10.0] flags=2 cpu=1 credit=9097256 [w=256]

Here it is, d10v0, caught running on pCPU #1. flags=2 is
__CSFLAG_scheduled, which confirms the above.

Such flag is set when the vcpu is scheduled and cleared during context
switch tail, in csched2_context_saved().

If it's a Xen issue, it seems to be causing context_saved() to never be
called which, in ARM, should happen in schedule_tail().

I guess it could also be an issue in the guest kernel. Some kind of
livelock comes to my mind, but it feels weird that it would involve only
one vcpu... Maybe someone with more ARM experience than me (i.e., more
than 0 ARM experience!) can help decipher this, to try figuring out what
the vCPU is actually doing:

Jun 10 14:17:00.215131 (XEN) *** Dumping CPU0 guest state (d10v0): ***
Jun 10 14:17:00.219052 (XEN) ----[ Xen-4.6-unstable  arm32  debug=y  Not 
tainted ]----
Jun 10 14:17:00.224663 (XEN) CPU:    0
Jun 10 14:17:00.226157 (XEN) PC:     c0413bb8
Jun 10 14:17:00.228279 (XEN) CPSR:   20000153 MODE:32-bit Guest SVC
Jun 10 14:17:00.232288 (XEN)      R0: 00000022 R1: ffffffff R2: 00000004 R3: 
c0413b88
Jun 10 14:17:00.237796 (XEN)      R4: 07a36158 R5: 00000001 R6: 0000002f R7: 
c0bbb708
Jun 10 14:17:00.243401 (XEN)      R8: 07a36090 R9: c0b68000 R10:c0c49870 
R11:600001d3 R12:00000000
Jun 10 14:17:00.250028 (XEN) USR: SP: 00000000 LR: 00000000
Jun 10 14:17:00.253292 (XEN) SVC: SP: c0b69cb8 LR: c080c6e4 SPSR:000001d3
Jun 10 14:17:00.257806 (XEN) ABT: SP: c0c4700c LR: c02126a0 SPSR:800001d3
Jun 10 14:17:00.262281 (XEN) UND: SP: c0c47018 LR: c0c47018 SPSR:00000000
Jun 10 14:17:00.266800 (XEN) IRQ: SP: c0c47000 LR: c0c47000 SPSR:00000000
Jun 10 14:17:00.271426 (XEN) FIQ: SP: 00000000 LR: 00000000 SPSR:00000000
Jun 10 14:17:00.275907 (XEN) FIQ: R8: 00000000 R9: 00000000 R10:00000000 
R11:00000000 R12:00000000
Jun 10 14:17:00.282522 (XEN) 
Jun 10 14:17:00.283175 (XEN)      SCTLR: 10c5387d
Jun 10 14:17:00.285659 (XEN)        TCR: 00000000
Jun 10 14:17:00.288046 (XEN)      TTBR0: 000000004020406a
Jun 10 14:17:00.291174 (XEN)      TTBR1: 000000004020406a
Jun 10 14:17:00.294318 (XEN)       IFAR: 00000000, IFSR: 00000000
Jun 10 14:17:00.298194 (XEN)       DFAR: 9bfd9000, DFSR: 00002805
Jun 10 14:17:00.301903 (XEN) 
Jun 10 14:17:00.302648 (XEN)   VTCR_EL2: 80003558
Jun 10 14:17:00.305035 (XEN)  VTTBR_EL2: 00020000bfe7e000
Jun 10 14:17:00.308153 (XEN) 
Jun 10 14:17:00.308901 (XEN)  SCTLR_EL2: 30cd187f
Jun 10 14:17:00.311276 (XEN)    HCR_EL2: 000000000038643f
Jun 10 14:17:00.314407 (XEN)  TTBR0_EL2: 00000000bfef0000
Jun 10 14:17:00.317527 (XEN) 
Jun 10 14:17:00.318276 (XEN)    ESR_EL2: 07e00000
Jun 10 14:17:00.320651 (XEN)  HPFAR_EL2: 000000000001c810
Jun 10 14:17:00.323780 (XEN)      HDFAR: e0800f00
Jun 10 14:17:00.326153 (XEN)      HIFAR: f265a116
Jun 10 14:17:00.328650 (XEN) 
Jun 10 14:17:00.329280 (XEN) Guest stack trace from sp=c0b69cb8:
Jun 10 14:17:00.333027 (XEN)    00000000 c0b69ce4 00000008 c0b73a60 0000000b 
c0b68000 c0b74b30 0000000b
Jun 10 14:17:00.340168 (XEN)    00000000 c024bf68 c09b4cb4 c0b69cf0 00200000 
00000000 00000001 00000000
Jun 10 14:17:00.347281 (XEN)    600001d3 c080cb50 c09b4a60 c0b69d1c c0b69ee8 
c0c47244 c0b69ee8 c0b68000
Jun 10 14:17:00.354404 (XEN)    c0b74b30 0000000b 00000001 00000000 600001d3 
c0211ff8 c0b68248 0000000b
Jun 10 14:17:00.361534 (XEN)    c0c4cd2a c0b69d53 c021ed58 c0b69ee8 00000000 
00000008 653b36fb 31303830
Jun 10 14:17:00.368530 (XEN)    20313030 32343265 31303033 63316520 30303030 
35662033 30666637 28206634
Jun 10 14:17:00.375654 (XEN)    37306565 65336630 c0002029 c080cb50 c0a21b74 
00000000 9bfd9000 00002805
Jun 10 14:17:00.382780 (XEN)    c0b69ee8 c0b73a60 00002805 c0200000 c0c00000 
c080bf40 9bfd9000 c021c1e8
Jun 10 14:17:00.389906 (XEN)    00000000 00000001 00000000 5bbd7800 c0c75070 
c0bb5d18 00000000 5bbd9000
Jun 10 14:17:00.397032 (XEN)    00000006 c0c7507c 00000003 c02e2ce4 5bfff000 
5bbd9000 c0bb5d18 00000003
Jun 10 14:17:00.404158 (XEN)    00000000 c02e3af8 00000000 00000000 ffffffc0 
00000000 00000040 5bbd7800
Jun 10 14:17:00.411160 (XEN)    00001800 00001800 ffffffff 00002805 c0b758a0 
9bfd9000 c0b69ee8 c0bb5d00
Jun 10 14:17:00.418285 (XEN)    00040200 c0200000 c0c00000 c0208430 00000040 
00000000 5bbd7800 c0af1de4
Jun 10 14:17:00.425419 (XEN)    00000000 00000000 c0c44880 00001800 c0c44880 
c0b51f00 00020000 00020000
Jun 10 14:17:00.432531 (XEN)    00000007 dc7c9000 00000007 00000000 00060000 
c0b24c50 00040000 c0c44880
Jun 10 14:17:00.439657 (XEN)    00000000 0001fc00 c0b51f00 00020000 600001d3 
c02c8dd8 c0c45480 c0aef3c0
Jun 10 14:17:00.446787 (XEN)    00000000 00001000 c0b69f08 c0b69ef8 00060000 
00000000 5bfd9000 c021ed54
Jun 10 14:17:00.453907 (XEN)    800001d3 ffffffff c0b69f1c c02126d8 9bfd9000 
9bfda000 00000040 0000003f
Jun 10 14:17:00.460908 (XEN)    00000000 dbfd9000 c0c473f4 c0bb5b68 c0bb5d00 
00040200 c0200000 c0c00000
Jun 10 14:17:00.468036 (XEN)    c0b69f08 c0b69f30 c0acbd9c c021ed54 800001d3 
ffffffff 00040c00 c0b26fa8

I've looked at the rest of the debug output, and found nothing relevant,
or at least nothing relevant enough to me (again, I'm very unfamiliar
with ARM traces).

The timer status, perhaps, but it does not add much to the sketched
analysis above:

Jun 10 14:16:50.246933 (XEN) Dumping timer queues:
Jun 10 14:16:56.157817 (XEN) CPU00:
Jun 10 14:16:56.160771 (XEN)   ex=        1983us timer=002f8378 
cb=s_timer_fn(00000000)
Jun 10 14:16:56.164744 (XEN)   ex=      488252us timer=40025958 
cb=virt_timer_expired(40025950)
Jun 10 14:16:56.171105 (XEN) CPU01:
Jun 10 14:16:56.172333 (XEN)   ex=         185us timer=40014078 
cb=s_timer_fn(00000000)
Jun 10 14:16:56.178054 (XEN)   ex=      598248us timer=40027958 
cb=virt_timer_expired(40027950)

The timeout of s_timer_fn on pCPU #0 is ~2ms, which basically means the
pCPU is idle, and it's calling the scheduler at CSCHED2_MAX_TIMER
interval. OTOH, on pCPU #1 it's smaller than 500us, which tells us that
the last time the timer was armed, it was set to fire after
CSCHED2_MIN_TIMER, the smallest time interval between two calls to the
scheduler that Credit2 allows.

I'll keep looking into this tomorrow... Help from ARM people, much
appreciated. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.