[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: XenVif div by zero on Tx path after resume.


  • To: "paul@xxxxxxx" <paul@xxxxxxx>, "win-pv-devel@xxxxxxxxxxxxxxxxxxxx" <win-pv-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Martin Harvey <martin.harvey@xxxxxxxxxx>
  • Date: Fri, 22 Apr 2022 12:21:33 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=5vy7+jVBdz7OecmHMucgTG93VvKxR98sP46IVrSqxdk=; b=UKep2BoWs3NgcFODZUhQQU/B1AxwsfFOxaQf/6de75NUgCHq1QqTbw+l70dbrfU6oSDVTcxh4VMbpnymuoTk7WFk20BHYB27TdHqjZNaLIQpZarbWWOJ15gTdl/QwJOr+v4ThxDBIL0lWy2PLMGfvqifXU4sySQsAxqe7ozJcdRy/xPkNGYD8gRaI6jo7WPHCBa8Rz8yXxErWaTnjxe1hsVZyK1zsRaZX8IlwNaMj0GKGQeZj6F2+MvwdgE5P+fHrOmBmcoPAjcTgu30GnWno9575PxFxxqeLVBhrqpqXiu8rOCe9Tn7awDKrzJ48kpt35oPz5QHRUVNFDePN/BTDQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Rnko5fugpIAjosBZL/bjmPSG2WuZjwrlKvZ8VHA+ZIoDwQzvP4jcZhE4SgzQ1ZDiAMAUf3Kgwfjh/9WXaVUZy5Vm2k8w9naEy2NEUyUjwzmoG10ynB9lYJZgCr14KRJapwmpaDidJCJEeJGwFs1nbZB2GYCyk889498h29mIxEuyGOxUCMmkAvlnAODc6lmOr4qFoE73hv5uDg+9SJzOHD/xulOr9SsLLbTcYPMoc92Kqcex2xU5lWylYU50N1qmvaqKh26Z3R7OtPZSD4klDnsCChF/yBao/8k5MeqJFEhjYFIIpm9zhIsj4BR7GriNrTr+d2tP/QAGuvqA7/psLQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Delivery-date: Fri, 22 Apr 2022 12:21:44 +0000
  • Ironport-data: A9a23:9iSXga2W/NoRz8HlevbD5bFwkn2cJEfYwER7XKvMYLTBsI5bpzEFy TMXWG6EPf3ZamrwKYojOYXk/EgAscLSm9NjQVFrpC1hF35El5HIVI+TRqvS04J+DSFhoGZPt Zh2hgzodZhsJpPkjk7xdOCn9xGQ7InQLlbGILes1htZGEk1EE/NtTo5w7Rj2tIy24Dga++wk YiaT/P3aQfNNwFcagr424rbwP+4lK2v0N+wlgVWicFj5DcypVFMZH4sDfjZw0/DaptVBoaHq 9Prl9lVyI97EyAFUbtJmp6jGqEDryW70QKm0hK6UID66vROS7BbPg/W+5PwZG8O4whlkeydx /1jsZHtESEMM5HowqcCdgYISSEgBap/reqvzXiX6aR/zmXgWl61mbBCKR9zOocVvOFqHWtJ6 PoUbigXaQyOjP63x7T9TfRwgsMkL4/gO4Z3VnNIlGmFS6p5B82SBfyVure03x9p7ixKNfPXf dYYb3xvZQ7ebgxnMVYLEpMu2uyvgxETdhUH8AvN/PZnuQA/yiQg0ITXH53oVuW6QOlumF7Gj 3Di83jQV0Ry2Nu3jGDtHmiXrvfGgCfTSI8UUrqi+ZZCn1yVg3QNTREbS1a/if24kVKlHcJSL VQO/SgjprR081akJuQRRDW9qX+A+xIZBdxZFrRg7BnXk/WLpQGEGmIDUzhNLsQ8s9M7TiAr0 VnPmM71ATtos/ueTnf1GqqokA5e8BM9dQcqDRLohyNcizU/iOnfVi7yc+s=
  • Ironport-hdrordr: A9a23:y1edz68L6nrzYm7PDDVuk+Fsdb1zdoMgy1knxilNoENuH/Bwxv rFoB1E73TJYW4qKQkdcKO7SdK9qBTnhNRICOgqTPyftWzd1FdAQ7sSibcKrweAJ8S6zJ8l6U 4CSdkyNDSTNykcsS+S2mDRfLgdKZu8gcaVbIzlvhRQpHRRGsRdBnBCe2Sm+yNNJDVuNN4cLt 6x98BHrz2vdTA8dcKgHEQIWODFupniiI/mSQRuPW9p1CC+yReTrJLqGRmR2RkTFxlVx605zG TDmwvloo2+rvCAzAPG3WO71eUZpDKh8KoDOCW/sLlXFtzesHfrWG2nYczGgNkBmpDu1L/tqq iJn/5vBbU115qbRBDOnfKk4Xic7N9p0Q6v9bbQuwqeneXpAD09EMZPnoRfb1/Q7Fchpsh11O ZR03uerIc/N2K3oM3R3am9a/hRrDvCnZPiq59is1VPFY8FLLNBp40W+01YVJ8GASLh8YgiVO 1jFtvV6vpaeU6TKymxhBgm/PW8GnAoWhuWSEkLvcKYlzBQgXBi1kMdgMgShG0J+p4xQ4RNo+ 7ELqNrnrdTSdJ+V9M1OM4RBc+sTmDdSxPFN2yfZVzhCaEcInrI74X65b0kjdvaDaDgDKFC6q gpfGkoy1LaIXiedvFm9Kc7gyzlUSG6QSnnzN1Y6txwpqD8LYCbQhG+dA==
  • List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>
  • Thread-index: AdhP6kAbDA19BbYtRECv6F4UUU/GxAD72daAAGmnvwAAAFp2cAAAUguAAC/5hJA=
  • Thread-topic: XenVif div by zero on Tx path after resume.

> Looking in xenbus's sync.c and suspend.c, I can't see any issue... but there 
> could be something subtle going on (like Windows deciding all DPCs will be 
> threaded under the hood).

Yeah, OK. Either that, or some code lowers the IRQL somewhere. Wouldn't be 
surprised if some code has been added to one of the resume callbacks which 
tries to block, lowers the IRQL (which is not caught because annotations / 
system doesn't know we're always meant to be at dispatch), and then it all 
falls apart.

I do have automated bugcheck analysis on our test code, so if code examination 
doesn't work, I can change the automated WinDbg commands to give us some idea 
of what's going on.


Do feel free to have a butchers at my working notes (included), which indicate 
quite clearly that the required atomicity is not there:

----------------------------------------------------------

Paul says:

The whole suspend/resume cycle with or without migrate is supposed to be 
transparent to the rest of the system; it's not any sort of known power 
transition, it's very much Xen-specific and so needs to be handled entirely 
within the PV drivers.

Even the late suspend callback runs on a single vCPU at DISPATCH, with all 
other vCPUs spinning at DISPATCH. Thus the only thing that should be able to 
pre-empt it is an interrupt. Hence there *should* be no scope for the the 
network stack to send any packets until the callback has completed its work.


However, we need to check thru logs, because what is happening appears to be at 
some variance with this.

Suite runs 167912 - 167919.

167912, No fails.
167913, OS/Install failed - unrelated.
167914, No win fails, some skipped.
167916, Job 3497156. OxDEAD in XenVif.sys - daemon.log to be downloaded.
167917, No fails.
167918, Job 3497168. OxDEAD in XenVif.sys - daemon.log to be downloaded.
167919, No fails.

OK. 3497156. Good candidate for the bug. Filtered stuff.

PdoSuspendCallbackLate:

runs 167912 - 167919\167916_3497156_daemon.log:

  45338: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539037:xen_platform_log xen platform: 
xenvif|PdoSuspendCallbackLate: (PdoSuspendCallbackLate) MCHTRACE ====>
  45339: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539141:xen_platform_log xen platform: xenvif|__PdoD0ToD3: 
(__PdoD0ToD3) MCHTRACE ====>
  45340: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539267:xen_platform_log xen platform: xenvif|FrontendSetState: 
FrontendSetState:device/vif/0: MCHTRACE ====> 'CLOSED' -> 'CLOSED'
  45341: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539383:xen_platform_log xen platform: xenvif|FrontendSetState: 
FrontendSetState:device/vif/0: MCHTRACE <=====
  45342: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539486:xen_platform_log xen platform: xenvif|__PdoD0ToD3: 
(__PdoD0ToD3) MCHTRACE <====
  45343: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539591:xen_platform_log xen platform: xenvif|__PdoD3ToD0: 
(__PdoD3ToD0) MCHTRACE ====>
  45344: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539717:xen_platform_log xen platform: xenvif|FrontendSetState: 
FrontendSetState:device/vif/0: MCHTRACE ====> 'CLOSED' -> 'CONNECTED'
  45820: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.610462:xen_platform_log xen platform: xen|BUGCHECK: ====>
  45821: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.610662:xen_platform_log xen platform: xen|BUGCHECK: 
ASSERTION_FAILURE: FFFFF80619FA7C50 FFFFF80619FA7C70 000000000000144E 
0000000000000000
  45822: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.610868:xen_platform_log xen platform: xen|BUGCHECK: FILE: 
E:\jenkins\workspace\nvif_private_martinhar_CA-355670\local\src\xenvif\transmitter.c
 LINE: 5198
  45823: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.610988:xen_platform_log xen platform: xen|BUGCHECK: TEXT: 
!NT_SUCCESS(status)
  45824: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.611094:xen_platform_log xen platform: xen|BUGCHECK: CONTEXT 
(FFFF958817CA0A30):
  45825: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.611185:xen_platform_log xen platform: xen|BUGCHECK: - GS = 002B
  45826: Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.611287:xen_platform_log xen platform: xen|BUGCHECK: - FS = 0053

So we are definitely in PdoSuspendCallbackLate, and we don't exit it, and we 
don't get round to the VifCallbackLate.

Even a cursory examination of the logs shows that xeniface is executing at a 
lower Irql, amongst other things.

Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539037:xen_platform_log xen platform: 
xenvif|PdoSuspendCallbackLate: (PdoSuspendCallbackLate) MCHTRACE ====>
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539141:xen_platform_log xen platform: xenvif|__PdoD0ToD3: 
(__PdoD0ToD3) MCHTRACE ====>
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539267:xen_platform_log xen platform: xenvif|FrontendSetState: 
FrontendSetState:device/vif/0: MCHTRACE ====> 'CLOSED' -> 'CLOSED'
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539383:xen_platform_log xen platform: xenvif|FrontendSetState: 
FrontendSetState:device/vif/0: MCHTRACE <=====
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539486:xen_platform_log xen platform: xenvif|__PdoD0ToD3: 
(__PdoD0ToD3) MCHTRACE <====
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539591:xen_platform_log xen platform: xenvif|__PdoD3ToD0: 
(__PdoD3ToD0) MCHTRACE ====>
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539717:xen_platform_log xen platform: xenvif|FrontendSetState: 
FrontendSetState:device/vif/0: MCHTRACE ====> 'CLOSED' -> 'CONNECTED'
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539817:xen_platform_log xen platform: xeniface|IoctlLog: USER: 
RTC is in local time
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.539933:xen_platform_log xen platform: 
xeniface|FdoInitialiseXSRegistryEntries: About to convert unicode string
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.540125:xen_platform_log xen platform: 
xeniface|FdoInitialiseXSRegistryEntries: About to write unicode string
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.540241:xen_platform_log xen platform: xeniface|IoctlLog: USER: 
Time Now = 2022/4/20 13:26:07.483
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.540345:xen_platform_log xen platform: xeniface|IoctlLog: USER: 
New Time = 2022/4/20 13:31:39.14
Apr 20 13:34:24 xrtmia-13-06 qemu-dm-46[29574]: 
29574@1650461664.540449:xen_platform_log xen platform: 
xenvif|TransmitterQueuePacket: BUG: !NT_SUCCESS(status)

I'll start looking at IRQL violations (blocking from DISPATCH), as well as how 
on earth the corral is supposed to work.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.