[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] S3 crash with VTD Queue Invalidation enabled



On Fri, Jun 14, 2013 at 1:01 PM, Ben Guthro <ben@xxxxxxxxxx> wrote:
> On Fri, Jun 14, 2013 at 4:38 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>> On 06.06.13 at 01:53, Ben Guthro <ben@xxxxxxxxxx> wrote:
>>>> Early in the boot process, I see queue_invalidate_wait() called for
>>>> DRHD unit 0, and 1
>>>> (unit 0 is wired up to the IGD, unit 1 is everything else)
>>>>
>>>> Up until i915 does the following, I see that unit being flushed with
>>>> queue_invalidate_wait() :
>>>>
>>>> [    0.704537] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
>>>> [    0.704537] ENERGY_PERF_BIAS: View and update with x86_energy_p
>>>> (XEN) XXX queue_invalidate_wait:282 CPU0 DRHD0 ret=0
>>>> (XEN) XXX queue_invalidate_wait:282 CPU0 DRHD0 ret=0
>>>> [    1.983028] [drm] GMBUS [i915 gmbus dpb] timed out, falling back to
>>>> bit banging on pin 5
>>>> [    2.253551] fbcon: inteldrmfb (fb0) is primary device
>>>> [    3.111838] Console: switching to colour frame buffer device 170x48
>>>> [    3.171631] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
>>>> [    3.171634] i915 0000:00:02.0: registered panic notifier
>>>> [    3.173339] acpi device:00: registered as cooling_device1
>>>> [    3.173401] ACPI: Video Device [VID] (multi-head: yes  rom: no  post: 
>>>> no)
>>>> [    3.173962] input: Video Bus as
>>>> /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input4
>>>> [    3.174232] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on 
>>>> minor 0
>>>> [    3.174258] ahci 0000:00:1f.2: version 3.0
>>>> [    3.174270] xen: registering gsi 19 triggering 0 polarity 1
>>>> [    3.174274] Already setup the GSI :19
>>>>
>>>>
>>>> After that - the unit never seems to be flushed.
>>
>> With queue_invalidate_wait() having a single caller -
>> invalidate_sync() -, and with invalidate_sync() being called from
>> all interrupt setup (IO-APIC as well as MSI), that's quite odd to be
>> the case. At least upon network driver load or interface-up, this
>> should be getting called.
>>
>>>> ...until we enter into the S3 hypercall, which loops over all DRHD
>>>> units, and explicitly flushes all of them via iommu_flush_all()
>>>>
>>>> It is at that point that it hangs up when talking to the device that
>>>> the IGD is plumbed up to.
>>>>
>>>>
>>>> Does this point to something in the i915 driver doing something that
>>>> is incompatible with Xen?
>>>
>>> I actually separated it from the S3 hypercall, adding a new debug key
>>> 'F' - to just call iommu_flush_all()
>>> I can crash it on demand with this.
>>>
>>> Booting with "i915.modeset=0 single" (to prevent both KMS, and Xorg) -
>>> it does not occur.
>>> So, that pretty much narrows it down to the IGD, in my mind.
>>
>> Which reminds me of a change I did several weeks back to our kernel,
>> but which isn't as easily done with pv-ops: There are a number of
>> cases in the AGP and DRM code that qualify upon CONFIG_INTEL_IOMMU
>> and use intel_iommu_gfx_mapped. As you certainly know, Linux when
>> running on Xen doesn't see any IOMMU, and hence the config option
>> being enabled or disabled is completely unrelated to whether the
>> driver actually runs on top of an enabled IOMMU. Similarly the setting
>> of intel_iommu_gfx_mapped cannot possibly happen when running on
>> top of Xen, as it sits in code that never gets used in this case.
>>
>> A possibly simple, but rather hacky solution might be to always set
>> that variable when running on Xen. But that wouldn't cover the case
>> of a kernel being built without CONFIG_INTEL_IOMMU, yet in that
>> case the driver might still run with an IOMMU enabled underneath.
>> (In our case I can simply always #define intel_iommu_gfx_mapped
>> to 1, with the INTEL_IOMMU option getting forcibly disabled for the
>> Xen kernel flavors anyway. Whether that's entirely correct when
>> not running on an enabled IOMMU I can't tell yet, and don't know
>> whom to ask.)
>>
>> And that wouldn't cover the IGD getting passed through to a DomU
>> at all - obviously Xen's ability to properly drive all IOMMU operations
>> (including qinval) must not depend on the owning guest's driver code.
>>
>> I have to admit though that it entirely escapes me why a graphics
>> driver needs to peek into IOMMU code/state in the first place. This
>> very much smells of bad design.
>
>
> This all makes sense, and I agree with your assessment.
>
> Unfortunately, I went and got the machine back from our QA department,
> to do some tests on this, and now I am unable to reproduce the issue,
> to prove your analysis is correct.
> It was 100% reproducible a week ago, and now I can't make it happen,
> using the same code base & build.
>
> It is all very strange, and smells of a race condition, or
> uninitialized variable.
> I blame Alpha particles.

I did a little more bisecting of our builds, and it appears I was not
actually testing the version that I thought I was here, and once I did
some bisection, I found it got inadvertently fixed by another change
someone else committed to solve an unrelated problem.

The following changes

Revert:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c79c49826270b8b0061b2fca840fc3f013c8a78a

Apply:
https://lkml.org/lkml/2012/2/10/229

I don't have a good explanation as to why re-enabling PAT would change
the behavior of this IOMMU feature...but I have a very reproducible
test case showing that it, in fact does.

Konrad - do you have any theories that would explain this one?
Or, would we like to leave this one as "Here be Dragons" and look the other way?

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.