[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] S3 crash with VTD Queue Invalidation enabled
On Fri, Jun 14, 2013 at 1:01 PM, Ben Guthro <ben@xxxxxxxxxx> wrote: > On Fri, Jun 14, 2013 at 4:38 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote: >>>>> On 06.06.13 at 01:53, Ben Guthro <ben@xxxxxxxxxx> wrote: >>>> Early in the boot process, I see queue_invalidate_wait() called for >>>> DRHD unit 0, and 1 >>>> (unit 0 is wired up to the IGD, unit 1 is everything else) >>>> >>>> Up until i915 does the following, I see that unit being flushed with >>>> queue_invalidate_wait() : >>>> >>>> [ 0.704537] ENERGY_PERF_BIAS: Set to 'normal', was 'performance' >>>> [ 0.704537] ENERGY_PERF_BIAS: View and update with x86_energy_p >>>> (XEN) XXX queue_invalidate_wait:282 CPU0 DRHD0 ret=0 >>>> (XEN) XXX queue_invalidate_wait:282 CPU0 DRHD0 ret=0 >>>> [ 1.983028] [drm] GMBUS [i915 gmbus dpb] timed out, falling back to >>>> bit banging on pin 5 >>>> [ 2.253551] fbcon: inteldrmfb (fb0) is primary device >>>> [ 3.111838] Console: switching to colour frame buffer device 170x48 >>>> [ 3.171631] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device >>>> [ 3.171634] i915 0000:00:02.0: registered panic notifier >>>> [ 3.173339] acpi device:00: registered as cooling_device1 >>>> [ 3.173401] ACPI: Video Device [VID] (multi-head: yes rom: no post: >>>> no) >>>> [ 3.173962] input: Video Bus as >>>> /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input4 >>>> [ 3.174232] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on >>>> minor 0 >>>> [ 3.174258] ahci 0000:00:1f.2: version 3.0 >>>> [ 3.174270] xen: registering gsi 19 triggering 0 polarity 1 >>>> [ 3.174274] Already setup the GSI :19 >>>> >>>> >>>> After that - the unit never seems to be flushed. >> >> With queue_invalidate_wait() having a single caller - >> invalidate_sync() -, and with invalidate_sync() being called from >> all interrupt setup (IO-APIC as well as MSI), that's quite odd to be >> the case. At least upon network driver load or interface-up, this >> should be getting called. >> >>>> ...until we enter into the S3 hypercall, which loops over all DRHD >>>> units, and explicitly flushes all of them via iommu_flush_all() >>>> >>>> It is at that point that it hangs up when talking to the device that >>>> the IGD is plumbed up to. >>>> >>>> >>>> Does this point to something in the i915 driver doing something that >>>> is incompatible with Xen? >>> >>> I actually separated it from the S3 hypercall, adding a new debug key >>> 'F' - to just call iommu_flush_all() >>> I can crash it on demand with this. >>> >>> Booting with "i915.modeset=0 single" (to prevent both KMS, and Xorg) - >>> it does not occur. >>> So, that pretty much narrows it down to the IGD, in my mind. >> >> Which reminds me of a change I did several weeks back to our kernel, >> but which isn't as easily done with pv-ops: There are a number of >> cases in the AGP and DRM code that qualify upon CONFIG_INTEL_IOMMU >> and use intel_iommu_gfx_mapped. As you certainly know, Linux when >> running on Xen doesn't see any IOMMU, and hence the config option >> being enabled or disabled is completely unrelated to whether the >> driver actually runs on top of an enabled IOMMU. Similarly the setting >> of intel_iommu_gfx_mapped cannot possibly happen when running on >> top of Xen, as it sits in code that never gets used in this case. >> >> A possibly simple, but rather hacky solution might be to always set >> that variable when running on Xen. But that wouldn't cover the case >> of a kernel being built without CONFIG_INTEL_IOMMU, yet in that >> case the driver might still run with an IOMMU enabled underneath. >> (In our case I can simply always #define intel_iommu_gfx_mapped >> to 1, with the INTEL_IOMMU option getting forcibly disabled for the >> Xen kernel flavors anyway. Whether that's entirely correct when >> not running on an enabled IOMMU I can't tell yet, and don't know >> whom to ask.) >> >> And that wouldn't cover the IGD getting passed through to a DomU >> at all - obviously Xen's ability to properly drive all IOMMU operations >> (including qinval) must not depend on the owning guest's driver code. >> >> I have to admit though that it entirely escapes me why a graphics >> driver needs to peek into IOMMU code/state in the first place. This >> very much smells of bad design. > > > This all makes sense, and I agree with your assessment. > > Unfortunately, I went and got the machine back from our QA department, > to do some tests on this, and now I am unable to reproduce the issue, > to prove your analysis is correct. > It was 100% reproducible a week ago, and now I can't make it happen, > using the same code base & build. > > It is all very strange, and smells of a race condition, or > uninitialized variable. > I blame Alpha particles. I did a little more bisecting of our builds, and it appears I was not actually testing the version that I thought I was here, and once I did some bisection, I found it got inadvertently fixed by another change someone else committed to solve an unrelated problem. The following changes Revert: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c79c49826270b8b0061b2fca840fc3f013c8a78a Apply: https://lkml.org/lkml/2012/2/10/229 I don't have a good explanation as to why re-enabling PAT would change the behavior of this IOMMU feature...but I have a very reproducible test case showing that it, in fact does. Konrad - do you have any theories that would explain this one? Or, would we like to leave this one as "Here be Dragons" and look the other way? _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |