[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Config space access to Mediatek MT7922 doesn't work after device reset in Xen PV dom0 (regression, Linux 6.12)
On Wed, Jan 29, 2025 at 04:22:43AM +0100, Marek Marczykowski-Górecki wrote: > On Tue, Jan 28, 2025 at 09:03:15PM -0600, Bjorn Helgaas wrote: > > On Wed, Jan 29, 2025 at 03:10:49AM +0100, Marek Marczykowski-Górecki wrote: > > > On Tue, Jan 28, 2025 at 07:15:26PM -0600, Bjorn Helgaas wrote: > > > > On Fri, Jan 17, 2025 at 01:05:30PM +0100, Marek Marczykowski-Górecki > > > > wrote: > > > > > After updating PV dom0 to Linux 6.12, The Mediatek MT7922 device > > > > > reports > > > > > all 0xff when accessing its config space. This happens only after > > > > > device > > > > > reset (which is also triggered when binding the device to the > > > > > xen-pciback driver). > > > > > > > > Thanks for the report and for all the debugging you've already done! > > > > > > > > > Reproducer: > > > > > > > > > > # lspci -xs 01:00.0 > > > > > 01:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI > > > > > Express Wireless Network Adapter > > > > > 00: c3 14 16 06 00 00 10 00 00 00 80 02 10 00 00 00 > > > > > ... > > > > > # echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset > > > > > # lspci -xs 01:00.0 > > > > > 01:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI > > > > > Express Wireless Network Adapter > > > > > 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > > > > > > > > > > The same operation done on Linux 6.12 running without Xen works fine. > > > > > > > > > > git bisect points at: > > > > > > > > > > commit d591f6804e7e1310881c9224d72247a2b65039af > > > > > Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> > > > > > Date: Tue Aug 27 18:48:46 2024 -0500 > > > > > > > > > > PCI: Wait for device readiness with Configuration RRS > > > > > > > > > > part of that commit: > > > > > @@ -1311,9 +1320,15 @@ static int pci_dev_wait(struct pci_dev *dev, > > > > > char *reset_type, int timeout) > > > > > return -ENOTTY; > > > > > } > > > > > > > > > > - pci_read_config_dword(dev, PCI_COMMAND, &id); > > > > > - if (!PCI_POSSIBLE_ERROR(id)) > > > > > - break; > > > > > + if (root && root->config_crs_sv) { > > > > > + pci_read_config_dword(dev, PCI_VENDOR_ID, > > > > > &id); > > > > > + if (!pci_bus_crs_vendor_id(id)) > > > > > + break; > > > > > + } else { > > > > > + pci_read_config_dword(dev, PCI_COMMAND, &id); > > > > > + if (!PCI_POSSIBLE_ERROR(id)) > > > > > + break; > > > > > + } > > > > > > > > > > > > > > > Adding some debugging, the PCI_VENDOR_ID read in pci_dev_wait() > > > > > returns > > > > > initially 0xffffffff. If I extend the condition with > > > > > "&& !PCI_POSSIBLE_ERROR(id)", then the issue disappear. But reading > > > > > the > > > > > patch description, it would break VF. > > > > > I'm not sure where the issue is, but given it breaks only when running > > > > > with Xen, I guess something is wrong with "Configuration RRS Software > > > > > Visibility" in that case. > > > > > > > > I'm missing something. If you get 0xffffffff, that is not the 0x0001 > > > > Vendor ID, so pci_dev_wait() should exit immediately. > > > > > > I'm not sure what is going on there either, but my _guess_ is that the > > > loop exits too early due to the above. And it makes some further actions > > > to fail. > > > > When RRS SV is enabled, reading PCI_VENDOR_ID should always return > > 0x0001 (if the device isn't ready and responds with RRS status) or the > > valid Vendor ID. I don't think it should ever return 0xffff (unless > > the device is powered off, unplugged, or broken, of course). > > Maybe it isn't really enabled when Xen is involved? > By looking at lspci of the bridge for this device, I do see RootCtl: ... > CRSVisible+, but maybe there is something else needed too? As far as I know, CRSVisible+ is all that's needed to enable this, and Linux always enables it if it's supported [3] > Just in case, full lspci -vvvs 2.2 (the bridge): > > > 00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge > (prog-if 00 [Normal decode]) > Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1453 > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin ? routed to IRQ 102 > Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 > I/O behind bridge: 0000f000-00000fff [disabled] [32-bit] > Memory behind bridge: 90b00000-90bfffff [size=1M] [32-bit] > Prefetchable memory behind bridge: 8010900000-80109fffff [size=1M] > [32-bit] > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > <MAbort+ <SERR- <PERR- > BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B- > PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- > Capabilities: [50] Power Management version 3 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [58] Express (v2) Root Port (Slot+), IntMsgNum 0 > DevCap: MaxPayload 256 bytes, PhantFunc 0 > ExtTag+ RBE+ TEE-IO- > DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- > TransPend- > LnkCap: Port #4, Speed 16GT/s, Width x1, ASPM L1, Exit Latency > L1 <64us > ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+ > LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+ > ExtSynch+ ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 5GT/s, Width x1 > TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt- > SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- > Surprise- > Slot #0, PowerLimit 75W; Interlock- NoCompl+ > SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- > LinkChg- > Control: AttnInd Unknown, PwrInd Unknown, Power- > Interlock- > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ > Interlock- > Changed: MRL- PresDet- LinkState+ > RootCap: CRSVisible+ > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ > CRSVisible+ > RootSta: PME ReqID 0000, PMEStatus- PMEPending- > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- > LTR+ > 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ > EETLPPrefix+, MaxEETLPPrefixes 1 > EmergencyPowerReduction Not Supported, > EmergencyPowerReductionInit- > FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- > ARIFwd+ > AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS- > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- ARIFwd- > AtomicOpsCtl: ReqEn- EgressBlck- > IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq- > 10BitTagReq- OBFF Disabled, EETLPPrefixBlk- > LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ > 2Retimers+ DRS- > LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, > EnterModifiedCompliance- ComplianceSOS- > Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB > preshoot > LnkSta2: Current De-emphasis Level: -3.5dB, > EqualizationComplete- EqualizationPhase1- > EqualizationPhase2- EqualizationPhase3- > LinkEqualizationRequest- > Retimer- 2Retimers- CrosslinkRes: unsupported > Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ > Address: 00000000fee08000 Data: 4000 > Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Device > 1453 > Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+ > Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 > Len=010 <?> > Capabilities: [270 v1] Secondary PCI Express > LnkCtl3: LnkEquIntrruptEn- PerformEqu- > LaneErrStat: 0 > Capabilities: [2a0 v1] Access Control Services > ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ > EgressCtrl- DirectTrans+ > ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ > EgressCtrl- DirectTrans- > Capabilities: [370 v1] L1 PM Substates > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ > L1_PM_Substates+ > PortCommonModeRestoreTime=10us PortTPowerOnTime=150us > L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ > T_CommonMode=10us LTR1.2_Threshold=166912ns > L1SubCtl2: T_PwrOn=150us > Capabilities: [400 v1] Data Link Feature <?> > Capabilities: [410 v1] Physical Layer 16.0 GT/s <?> > Capabilities: [440 v1] Lane Margining at the Receiver > PortCap: Uses Driver- > PortSta: MargReady- MargSoftReady- > Kernel driver in use: pcieport > > > > > But the log at > > > > https://github.com/QubesOS/qubes-issues/issues/9689#issuecomment-2582927149 > > > > says it *doesn't* exit and eventually times out. > > > > > > Note this log is from "working" kernel, so that timeout must be > > > something else. > > > > I saw it was labeled "NO BUG" but I'm not sure it's labeled correctly > > since there are no interesting messages from the "BUG PRESENT" part. > > Awfully funny coincidence if it's unrelated. > > The timeout thing I have seen before, possibly also on a different > hardware (although I'm not 100% sure), I think it's a different issue. > > > > > And the lspci above shows ~0 data for much of the header, even though > > > > the device must be ready by then. > > > > > > > > I don't have any good ideas, but since the problem only happens with > > > > Xen, and it seems to affect more than just the Vendor ID, maybe you > > > > could instrument xen_pcibk_config_read() and see if there's something > > > > wonky going on there? > > > > > > This one is used when pcifront (from a different PV VM) is asking pciback > > > to read something. I see the issue even before starting any other VM and > > > not even attaching the device to the xen-pciback driver... > > > > The report claims the problem only happens with Xen. I'm not a Xen > > person, and I don't know how to find the relevant config accessors. > > The snippets of kernel messages I see at [1] all mention pciback, so > > that's my only clue of where to look. Bottom line, I have no idea > > what the config accessor path is, and maybe we could learn something > > by looking at whatever it is. > > AFAIK there are no separate config accessors under Xen dom0, the default > ones are used. xen-pcifront takes over PCI config space access (and few > more) only in a domU (and only for PV), when PCI passthrough is used. > Here, it didn't went that far... > > But then, Xen may intercept such access [2]. If I read it right, it > should allow all access (is_hardware_domain(dom0)==true, and also the > device is not on ro_map - otherwise reset wouldn't work at all). I guess the code at [2] is running in user mode and uses Linux syscalls for config access? Is it straceable? Can you reproduce this without Xen at all? If so, can you post a complete dmesg and complete lspci -vv somewhere? > > [1] > > https://github.com/QubesOS/qubes-issues/issues/9689#issuecomment-2582927149 > > [2] > https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/pv/emul-priv-op.c;h=70150c27227661baa253af8693ff00f2ab640a98;hb=HEAD#l295 [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/probe.c?id=v6.13#n1208
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |