[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: XenVif div by zero on Tx path after resume.
On 14/04/2022 11:27, Martin Harvey wrote: Hi Paul (and others!) I have done a bit of digging on this, and it looks like it's due to changes made in the suspend/resume path. See log at the bottom for the failure case. In summary: - Originally I suspect suspend callbacks were early (which actually lowered frontend state)... - and late, which is after the resume when devices and the system are powering back up. - ... and there was some (more) synchronization between suspend callbacks and PDO power state changes. For reasons which are not obvious, the frontend and pdo power states are left as up and running, and the Late callback cycles the frontend state down and back up again, leaving the VifSuspendCallbackLate to actually take the frontend back to the final CONNECTED state. That's the idea, yes. This raises a whole bunch of questions, not least: - Initial suspend does not lower PDO power state because it's on boot path / or some other reason? - Why frontend suspend callback early just sets "online" to false, instead of actually lowering the state properly. The early callbacks are run with interrupts disabled with all other vCPUs spinning, so we try to avoid doing very much there. - Where we actually use some suspend callbacks to request a change in system power, or is the suspend / resume / migrate supposed to be totally transparent? The whole suspend/resume cycle with or without migrate is supposed to be transparent to the rest of the system; it's not any sort of known power transition, it's very much Xen-specific and so needs to be handled entirely within the PV drivers. - How we're supposed to synchronise the Tx path with suspend / resume if the latter does not command some system or power state change visible to the OS when we request the guest suspends. Even the late suspend callback runs on a single vCPU at DISPATCH, with all other vCPUs spinning at DISPATCH. Thus the only thing that should be able to pre-empt it is an interrupt. Hence there *should* be no scope for the the network stack to send any packets until the callback has completed its work. As it is, the suspend late callbacks happen in a deferred manner, and there's nothing to stop the Tx path from making a request to send a packet if the OS cannot / has not seen a PDO power state change for the PV network device. As such, the current DIV by zero fix of dropping the packet seems to be to be an acceptable workaround. The alternative would be perhaps to explicitly synchronize the VIF suspend callbacks with PDO power state changes for the PV network device. How? With all the power state management done in thread context, it is automatically blocked by any suspend/resume because of the vCPU corralling and the fact that the active vCPU runs the entire cycle at DISPATCH or higher. Hence no need for any further synchronization. Thoughts? XEN|DEBUG: ====> (xenvif.sys + 0000000000008A40) xenvif|FRONTEND: PATH: device/vif/0 xenvif|FRONTEND: DEBUG CALLERS NEXT PUT PTR: 15 xenvif|FRONTEND: CALLER (0): __FrontendResume to state (PdoResume, FdoAddPhysicalDeviceObject) xenvif|FRONTEND: CALLER (1): __PdoD3ToD0 to state 3 (PdoStartDevice) xenvif|FRONTEND: CALLER (2): VifEnable to state 4 xenvif|FRONTEND: CALLER (3): __FrontendSuspend to state 0 (FrontendSuspendCallbackLate) xenvif|FRONTEND: CALLER (4): __FrontendResume to state 1 (FrontendSuspendCallbackLate) xenvif|FRONTEND: CALLER (5): __PdoD0ToD3 to state 1 (PdoSuspendCallbackLate) xenvif|FRONTEND: CALLER (6): __PdoD3ToD0 to state 3 (PdoSuspendCallbackLate) xenvif|FRONTEND: CALLER (7): VifSuspendCallbackLate to state 4 xenvif|FRONTEND: CALLER (8): __FrontendSuspend to state 0 (FrontendSuspendCallbackLate) xenvif|FRONTEND: CALLER (9): __FrontendResume to state 1 (FrontendSuspendCallbackLate) xenvif|FRONTEND: CALLER (10): __PdoD0ToD3 to state 1 (PdoSuspendCallbackLate) xenvif|FRONTEND: CALLER (11): __PdoD3ToD0 to state 3 (PdoSuspendCallbackLate) xenvif|FRONTEND: CALLER (12): VifSuspendCallbackLate to state 4 xenvif|FRONTEND: CALLER (13): __FrontendSuspend to state 0 (FrontendSuspendCallbackLate) xenvif|FRONTEND: CALLER (14): __FrontendResume to state 1 (FrontendSuspendCallbackLate) xenvif|FRONTEND: CALLER (15): (none) to state 0 xen|BUGCHECK: ====> xen|BUGCHECK: ASSERTION_FAILURE: FFFFF80113373A40 FFFFF80113373A60 000000000000144E 0000000000000000 xen|BUGCHECK: FILE: E:\jenkins\workspace\nvif_private_martinhar_CA-355670\local\src\xenvif\transmitter.c LINE: 5198 xen|BUGCHECK: TEXT: !NT_SUCCESS(status) So the question remains, how are we hitting the failure? Your source lines and mine clearly don't match. Exactly which assertion is failing? Paul
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |