[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] S3 crash with VTD Queue Invalidation enabled
On Wed, Jun 5, 2013 at 11:14 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>> On 05.06.13 at 15:54, Ben Guthro <ben@xxxxxxxxxx> wrote: >> On Wed, Jun 5, 2013 at 4:24 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>>>> On 04.06.13 at 23:09, Ben Guthro <ben@xxxxxxxxxx> wrote: >>>> On Tue, Jun 4, 2013 at 3:49 PM, Ben Guthro <ben@xxxxxxxxxx> wrote: >>>>> On Tue, Jun 4, 2013 at 3:20 PM, Ben Guthro <ben@xxxxxxxxxx> wrote: >>>>>> On Tue, Jun 4, 2013 at 10:01 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>>>>>>>> On 04.06.13 at 14:25, Ben Guthro <ben@xxxxxxxxxx> wrote: >>>>>>>> On Tue, Jun 4, 2013 at 4:54 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>>>>>>> Is this perhaps having some similarity with >>>>>>>>> http://lists.xen.org/archives/html/xen-devel/2013-04/msg00343.html? >>>>>>>>> We're clearly running single-CPU only here and there... >>>>>>>> >>>>>>>> We certainly should be, as we have gone through the >>>>>>>> disable_nonboot_cpus() by this point - and I can verify that from the >>>>>>>> logs. >>>>>>> >>>>>>> I'm much more tending towards the connection here, noting that >>>>>>> Andrew's original thread didn't really lead anywhere (i.e. we still >>>>>>> don't know what the panic he saw was actually caused by). >>>>>>> >>>>>> >>>>>> I'm starting to think you're on to something here. >>>>> >>>>> hmm - maybe not. >>>>> I get the same crash with "maxcpus=1" >>>>> >>>>> >>>>> >>>>>> I've put a bunch of trace throughout the functions in qinval.c >>>>>> >>>>>> It seems that everything is functioning properly, up until we go >>>>>> through the disable_nonboot_cpus() path. >>>>>> Prior to this, I see the qinval.c functions being executed on all >>>>>> cpus, and both drhd units >>>>>> Afterward, it gets stuck in queue_invalidate_wait on the first drhd >>>>>> unit.. and eventually panics. >>>>>> >>>>>> I'm not exactly sure what to make of this yet. >>>> >>>> querying status of the hardware all seems to be working correctly... >>>> it just doesn't work with querying the QINVAL_STAT_DONE state, as far >>>> as I can tell. >>>> >>>> Other register state is: >>>> >>>> (XEN) VER = 10 >>>> (XEN) CAP = c0000020e60262 >>>> (XEN) n_fault_reg = 1 >>>> (XEN) fault_recording_offset = 200 >>>> (XEN) fault_recording_reg_l = 0 >>>> (XEN) fault_recording_reg_h = 0 >>>> (XEN) ECAP = f0101a >>>> (XEN) GCMD = 0 >>>> (XEN) GSTS = c7000000 > > With > > #define DMA_GCMD_QIE (((u64)1) << 26) > > and > > #define DMA_GSTS_QIES (((u64)1) <<26) > > this means qinval is still enabled at this point. > >>>> (XEN) RTADDR = 137a31000 >>>> (XEN) CCMD = 800000000000000 >>>> (XEN) FSTS = 0 >>>> (XEN) FECTL = 0 >>>> (XEN) FEDATA = 4128 >>>> (XEN) FEADDR = fee0000c >>>> (XEN) FEUADDR = 0 >>>> >>>> (with code lifted from print_iommu_regs() ) >>>> >>>> >>>> None of this looks suspicious to my untrained eye - but I'm including >>>> it here in case someone else sees something I don't. >>> >>> Xiantao, you certainly will want to give some advice here. I won't >>> be able to look into this more deeply right away. >> >> Thanks Jan. Xiantao - I'd appreciate any insight you may have. >> >> One curious thing I have found, that seems buggy to me, is that >> {dis,en}able_qinval() is being called prior to the platform quirks >> being executed. >> It appears they are being called through iommu_{en,dis}able_x2apic_IR() > > That's because this setup needs to happen when interrupt (i.e. > APIC) initialization is happening, not when the IOMMUs get set > up (which is a process that assumes interrupts can already be > requested). > > In effect we have > > lapic_suspend() -> iommu_disable_x2apic_IR() -> > disable_intremap()/disable_qinval() > > after > > iommu_suspend() -> vtd_suspend() -> disable_qinval() > > but the latter tail call only when !iommu_intremap, and you > have been running with interrupt remapping enabled, so only > the former code path would result in qinval getting disabled, > which is after the point of the hang. > > Depending on whether ATS is in use, more than one invalidation > can be done in the processing here - could you therefore check > whether there's any sign of ATS use ("iommu=verbose" should > make you see respective messages), and if so see whether > disabling it ("ats=off") makes a difference? ATS does not appear to be running: (XEN) [VT-D]dmar.c:737: Host address width 36 (XEN) [VT-D]dmar.c:751: found ACPI_DMAR_DRHD: (XEN) [VT-D]dmar.c:412: dmaru->address = fed90000 (XEN) [VT-D]iommu.c:1197: drhd->address = fed90000 iommu->reg = ffff82c3ffd57000 (XEN) [VT-D]iommu.c:1199: cap = c0000020e60262 ecap = f0101a (XEN) [VT-D]dmar.c:338: endpoint: 0000:00:02.0 (XEN) [VT-D]dmar.c:751: found ACPI_DMAR_DRHD: (XEN) [VT-D]dmar.c:412: dmaru->address = fed91000 (XEN) [VT-D]iommu.c:1197: drhd->address = fed91000 iommu->reg = ffff82c3ffd56000 (XEN) [VT-D]iommu.c:1199: cap = c9008020660262 ecap = f0105a (XEN) [VT-D]dmar.c:354: IOAPIC: 0000:f0:1f.0 (XEN) [VT-D]dmar.c:332: MSI HPET: 0000:00:0f.0 (XEN) [VT-D]dmar.c:332: MSI HPET: 0000:00:0f.1 (XEN) [VT-D]dmar.c:332: MSI HPET: 0000:00:0f.2 (XEN) [VT-D]dmar.c:332: MSI HPET: 0000:00:0f.3 (XEN) [VT-D]dmar.c:332: MSI HPET: 0000:00:0f.4 (XEN) [VT-D]dmar.c:332: MSI HPET: 0000:00:0f.5 (XEN) [VT-D]dmar.c:332: MSI HPET: 0000:00:0f.6 (XEN) [VT-D]dmar.c:332: MSI HPET: 0000:00:0f.7 (XEN) [VT-D]dmar.c:426: flags: INCLUDE_ALL (XEN) [VT-D]dmar.c:756: found ACPI_DMAR_RMRR: (XEN) [VT-D]dmar.c:338: endpoint: 0000:00:1d.0 (XEN) [VT-D]dmar.c:338: endpoint: 0000:00:1a.0 (XEN) [VT-D]dmar.c:625: RMRR region: base_addr ba8d5000 end_address ba8ebfff (XEN) [VT-D]dmar.c:756: found ACPI_DMAR_RMRR: (XEN) [VT-D]dmar.c:338: endpoint: 0000:00:02.0 (XEN) [VT-D]dmar.c:625: RMRR region: base_addr bb800000 end_address bf9fffff I would expect a line with "found ACPI_DMAR_ATSR" to be printed, if it was found. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |