[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: XenVif div by zero on Tx path after resume.
> Looking in xenbus's sync.c and suspend.c, I can't see any issue... but there > could be something subtle going on (like Windows deciding all DPCs will be > threaded under the hood). Yeah, OK. Either that, or some code lowers the IRQL somewhere. Wouldn't be surprised if some code has been added to one of the resume callbacks which tries to block, lowers the IRQL (which is not caught because annotations / system doesn't know we're always meant to be at dispatch), and then it all falls apart. I do have automated bugcheck analysis on our test code, so if code examination doesn't work, I can change the automated WinDbg commands to give us some idea of what's going on. Do feel free to have a butchers at my working notes (included), which indicate quite clearly that the required atomicity is not there: ---------------------------------------------------------- Paul says: The whole suspend/resume cycle with or without migrate is supposed to be transparent to the rest of the system; it's not any sort of known power transition, it's very much Xen-specific and so needs to be handled entirely within the PV drivers. Even the late suspend callback runs on a single vCPU at DISPATCH, with all other vCPUs spinning at DISPATCH. Thus the only thing that should be able to pre-empt it is an interrupt. Hence there *should* be no scope for the the network stack to send any packets until the callback has completed its work. However, we need to check thru logs, because what is happening appears to be at some variance with this. Suite runs 167912 - 167919. 167912, No fails. 167913, OS/Install failed - unrelated. 167914, No win fails, some skipped. 167916, Job 3497156. OxDEAD in XenVif.sys - daemon.log to be downloaded. 167917, No fails. 167918, Job 3497168. OxDEAD in XenVif.sys - daemon.log to be downloaded. 167919, No fails. OK. 3497156. Good candidate for the bug. Filtered stuff. PdoSuspendCallbackLate: runs 167912 - 167919\167916_3497156_daemon.log: 45338: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539037:xen_platform_log xen platform: xenvif|PdoSuspendCallbackLate: (PdoSuspendCallbackLate) MCHTRACE ====> 45339: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539141:xen_platform_log xen platform: xenvif|__PdoD0ToD3: (__PdoD0ToD3) MCHTRACE ====> 45340: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539267:xen_platform_log xen platform: xenvif|FrontendSetState: FrontendSetState:device/vif/0: MCHTRACE ====> 'CLOSED' -> 'CLOSED' 45341: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539383:xen_platform_log xen platform: xenvif|FrontendSetState: FrontendSetState:device/vif/0: MCHTRACE <===== 45342: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539486:xen_platform_log xen platform: xenvif|__PdoD0ToD3: (__PdoD0ToD3) MCHTRACE <==== 45343: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539591:xen_platform_log xen platform: xenvif|__PdoD3ToD0: (__PdoD3ToD0) MCHTRACE ====> 45344: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539717:xen_platform_log xen platform: xenvif|FrontendSetState: FrontendSetState:device/vif/0: MCHTRACE ====> 'CLOSED' -> 'CONNECTED' 45820: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.610462:xen_platform_log xen platform: xen|BUGCHECK: ====> 45821: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.610662:xen_platform_log xen platform: xen|BUGCHECK: ASSERTION_FAILURE: FFFFF80619FA7C50 FFFFF80619FA7C70 000000000000144E 0000000000000000 45822: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.610868:xen_platform_log xen platform: xen|BUGCHECK: FILE: E:\jenkins\workspace\nvif_private_martinhar_CA-355670\local\src\xenvif\transmitter.c LINE: 5198 45823: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.610988:xen_platform_log xen platform: xen|BUGCHECK: TEXT: !NT_SUCCESS(status) 45824: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.611094:xen_platform_log xen platform: xen|BUGCHECK: CONTEXT (FFFF958817CA0A30): 45825: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.611185:xen_platform_log xen platform: xen|BUGCHECK: - GS = 002B 45826: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.611287:xen_platform_log xen platform: xen|BUGCHECK: - FS = 0053 So we are definitely in PdoSuspendCallbackLate, and we don't exit it, and we don't get round to the VifCallbackLate. Even a cursory examination of the logs shows that xeniface is executing at a lower Irql, amongst other things. Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539037:xen_platform_log xen platform: xenvif|PdoSuspendCallbackLate: (PdoSuspendCallbackLate) MCHTRACE ====> Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539141:xen_platform_log xen platform: xenvif|__PdoD0ToD3: (__PdoD0ToD3) MCHTRACE ====> Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539267:xen_platform_log xen platform: xenvif|FrontendSetState: FrontendSetState:device/vif/0: MCHTRACE ====> 'CLOSED' -> 'CLOSED' Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539383:xen_platform_log xen platform: xenvif|FrontendSetState: FrontendSetState:device/vif/0: MCHTRACE <===== Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539486:xen_platform_log xen platform: xenvif|__PdoD0ToD3: (__PdoD0ToD3) MCHTRACE <==== Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539591:xen_platform_log xen platform: xenvif|__PdoD3ToD0: (__PdoD3ToD0) MCHTRACE ====> Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539717:xen_platform_log xen platform: xenvif|FrontendSetState: FrontendSetState:device/vif/0: MCHTRACE ====> 'CLOSED' -> 'CONNECTED' Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539817:xen_platform_log xen platform: xeniface|IoctlLog: USER: RTC is in local time Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.539933:xen_platform_log xen platform: xeniface|FdoInitialiseXSRegistryEntries: About to convert unicode string Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.540125:xen_platform_log xen platform: xeniface|FdoInitialiseXSRegistryEntries: About to write unicode string Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.540241:xen_platform_log xen platform: xeniface|IoctlLog: USER: Time Now = 2022/4/20 13:26:07.483 Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.540345:xen_platform_log xen platform: xeniface|IoctlLog: USER: New Time = 2022/4/20 13:31:39.14 Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 29574@1650461664.540449:xen_platform_log xen platform: xenvif|TransmitterQueuePacket: BUG: !NT_SUCCESS(status) I'll start looking at IRQL violations (blocking from DISPATCH), as well as how on earth the corral is supposed to work.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |