[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XenVif div by zero on Tx path after resume.


  • To: win-pv-devel@xxxxxxxxxxxxxxxxxxxx
  • From: "Durrant, Paul" <xadimgnik@xxxxxxxxx>
  • Date: Tue, 19 Apr 2022 11:38:49 +0100
  • Delivery-date: Tue, 19 Apr 2022 10:38:56 +0000
  • List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>

On 14/04/2022 11:27, Martin Harvey wrote:
Hi Paul (and others!)

I have done a bit of digging on this, and it looks like it's due to changes 
made in the suspend/resume path. See log at the bottom for the failure case.

In summary:

- Originally I suspect suspend callbacks were early (which actually lowered 
frontend state)...
- and late, which is after the resume when devices and the system are powering 
back up.
- ... and there was some (more) synchronization between suspend callbacks and 
PDO power state changes.

For reasons which are not obvious, the frontend and pdo power states are left 
as up and running, and the
Late callback cycles the frontend state down and back up again, leaving the 
VifSuspendCallbackLate to actually take the frontend back to the final 
CONNECTED state.


That's the idea, yes.

This raises a whole bunch of questions, not least:

- Initial suspend does not lower PDO power state because it's on boot path / or 
some other reason?
- Why frontend suspend callback early just sets "online" to false, instead of 
actually lowering the state properly.

The early callbacks are run with interrupts disabled with all other vCPUs spinning, so we try to avoid doing very much there.

- Where we actually use some suspend callbacks to request a change in system 
power, or is the suspend / resume / migrate supposed to be totally transparent?

The whole suspend/resume cycle with or without migrate is supposed to be transparent to the rest of the system; it's not any sort of known power transition, it's very much Xen-specific and so needs to be handled entirely within the PV drivers.

- How we're supposed to synchronise the Tx path with suspend / resume if the 
latter does not command some system or power state change visible to the OS 
when we request the guest suspends.

Even the late suspend callback runs on a single vCPU at DISPATCH, with all other vCPUs spinning at DISPATCH. Thus the only thing that should be able to pre-empt it is an interrupt. Hence there *should* be no scope for the the network stack to send any packets until the callback has completed its work.


As it is, the suspend late callbacks happen in a deferred manner, and there's 
nothing to stop the Tx path from making a request to send a packet if the OS 
cannot / has not seen a PDO power state change for the PV network device.

As such, the current DIV by zero fix of dropping the packet seems to be to be 
an acceptable workaround. The alternative would be perhaps to explicitly 
synchronize the VIF suspend callbacks with PDO power state changes for the PV 
network device. How?

With all the power state management done in thread context, it is automatically blocked by any suspend/resume because of the vCPU corralling and the fact that the active vCPU runs the entire cycle at DISPATCH or higher. Hence no need for any further synchronization.


Thoughts?

XEN|DEBUG: ====> (xenvif.sys + 0000000000008A40)
xenvif|FRONTEND: PATH: device/vif/0
xenvif|FRONTEND: DEBUG CALLERS NEXT PUT PTR: 15
xenvif|FRONTEND: CALLER (0): __FrontendResume to state  (PdoResume, 
FdoAddPhysicalDeviceObject)
xenvif|FRONTEND: CALLER (1): __PdoD3ToD0 to state 3 (PdoStartDevice)
xenvif|FRONTEND: CALLER (2): VifEnable to state 4
xenvif|FRONTEND: CALLER (3): __FrontendSuspend to state 0 
(FrontendSuspendCallbackLate)
xenvif|FRONTEND: CALLER (4): __FrontendResume to state 1 
(FrontendSuspendCallbackLate)
xenvif|FRONTEND: CALLER (5): __PdoD0ToD3 to state 1 (PdoSuspendCallbackLate)
xenvif|FRONTEND: CALLER (6): __PdoD3ToD0 to state 3 (PdoSuspendCallbackLate)
xenvif|FRONTEND: CALLER (7): VifSuspendCallbackLate to state 4
xenvif|FRONTEND: CALLER (8): __FrontendSuspend to state 0 
(FrontendSuspendCallbackLate)
xenvif|FRONTEND: CALLER (9): __FrontendResume to state 1 
(FrontendSuspendCallbackLate)
xenvif|FRONTEND: CALLER (10): __PdoD0ToD3 to state 1 (PdoSuspendCallbackLate)
xenvif|FRONTEND: CALLER (11): __PdoD3ToD0 to state 3 (PdoSuspendCallbackLate)
xenvif|FRONTEND: CALLER (12): VifSuspendCallbackLate to state 4
xenvif|FRONTEND: CALLER (13): __FrontendSuspend to state 0 
(FrontendSuspendCallbackLate)
xenvif|FRONTEND: CALLER (14): __FrontendResume to state 1 
(FrontendSuspendCallbackLate)
xenvif|FRONTEND: CALLER (15): (none) to state 0

xen|BUGCHECK: ====>
xen|BUGCHECK: ASSERTION_FAILURE: FFFFF80113373A40 FFFFF80113373A60 
000000000000144E 0000000000000000
xen|BUGCHECK: FILE: 
E:\jenkins\workspace\nvif_private_martinhar_CA-355670\local\src\xenvif\transmitter.c
 LINE: 5198
xen|BUGCHECK: TEXT: !NT_SUCCESS(status)


So the question remains, how are we hitting the failure? Your source lines and mine clearly don't match. Exactly which assertion is failing?

  Paul









 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.