[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] S3 crash with VTD Queue Invalidation enabled

On Fri, Jun 14, 2013 at 4:38 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>> On 06.06.13 at 01:53, Ben Guthro <ben@xxxxxxxxxx> wrote:
>>> Early in the boot process, I see queue_invalidate_wait() called for
>>> DRHD unit 0, and 1
>>> (unit 0 is wired up to the IGD, unit 1 is everything else)
>>> Up until i915 does the following, I see that unit being flushed with
>>> queue_invalidate_wait() :
>>> [    0.704537] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
>>> [    0.704537] ENERGY_PERF_BIAS: View and update with x86_energy_p
>>> (XEN) XXX queue_invalidate_wait:282 CPU0 DRHD0 ret=0
>>> (XEN) XXX queue_invalidate_wait:282 CPU0 DRHD0 ret=0
>>> [    1.983028] [drm] GMBUS [i915 gmbus dpb] timed out, falling back to
>>> bit banging on pin 5
>>> [    2.253551] fbcon: inteldrmfb (fb0) is primary device
>>> [    3.111838] Console: switching to colour frame buffer device 170x48
>>> [    3.171631] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
>>> [    3.171634] i915 0000:00:02.0: registered panic notifier
>>> [    3.173339] acpi device:00: registered as cooling_device1
>>> [    3.173401] ACPI: Video Device [VID] (multi-head: yes  rom: no  post: no)
>>> [    3.173962] input: Video Bus as
>>> /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input4
>>> [    3.174232] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on 
>>> minor 0
>>> [    3.174258] ahci 0000:00:1f.2: version 3.0
>>> [    3.174270] xen: registering gsi 19 triggering 0 polarity 1
>>> [    3.174274] Already setup the GSI :19
>>> After that - the unit never seems to be flushed.
> With queue_invalidate_wait() having a single caller -
> invalidate_sync() -, and with invalidate_sync() being called from
> all interrupt setup (IO-APIC as well as MSI), that's quite odd to be
> the case. At least upon network driver load or interface-up, this
> should be getting called.
>>> ...until we enter into the S3 hypercall, which loops over all DRHD
>>> units, and explicitly flushes all of them via iommu_flush_all()
>>> It is at that point that it hangs up when talking to the device that
>>> the IGD is plumbed up to.
>>> Does this point to something in the i915 driver doing something that
>>> is incompatible with Xen?
>> I actually separated it from the S3 hypercall, adding a new debug key
>> 'F' - to just call iommu_flush_all()
>> I can crash it on demand with this.
>> Booting with "i915.modeset=0 single" (to prevent both KMS, and Xorg) -
>> it does not occur.
>> So, that pretty much narrows it down to the IGD, in my mind.
> Which reminds me of a change I did several weeks back to our kernel,
> but which isn't as easily done with pv-ops: There are a number of
> cases in the AGP and DRM code that qualify upon CONFIG_INTEL_IOMMU
> and use intel_iommu_gfx_mapped. As you certainly know, Linux when
> running on Xen doesn't see any IOMMU, and hence the config option
> being enabled or disabled is completely unrelated to whether the
> driver actually runs on top of an enabled IOMMU. Similarly the setting
> of intel_iommu_gfx_mapped cannot possibly happen when running on
> top of Xen, as it sits in code that never gets used in this case.
> A possibly simple, but rather hacky solution might be to always set
> that variable when running on Xen. But that wouldn't cover the case
> of a kernel being built without CONFIG_INTEL_IOMMU, yet in that
> case the driver might still run with an IOMMU enabled underneath.
> (In our case I can simply always #define intel_iommu_gfx_mapped
> to 1, with the INTEL_IOMMU option getting forcibly disabled for the
> Xen kernel flavors anyway. Whether that's entirely correct when
> not running on an enabled IOMMU I can't tell yet, and don't know
> whom to ask.)
> And that wouldn't cover the IGD getting passed through to a DomU
> at all - obviously Xen's ability to properly drive all IOMMU operations
> (including qinval) must not depend on the owning guest's driver code.
> I have to admit though that it entirely escapes me why a graphics
> driver needs to peek into IOMMU code/state in the first place. This
> very much smells of bad design.

This all makes sense, and I agree with your assessment.

Unfortunately, I went and got the machine back from our QA department,
to do some tests on this, and now I am unable to reproduce the issue,
to prove your analysis is correct.
It was 100% reproducible a week ago, and now I can't make it happen,
using the same code base & build.

It is all very strange, and smells of a race condition, or
uninitialized variable.
I blame Alpha particles.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.