[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] amd iommu: Dump flags of IO page faults



Friday, September 7, 2012, 10:54:40 AM, you wrote:

> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
>>
>> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>>
>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>>
>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>>
>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>>
>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>>>>
>>>>>>> Hi Jan,
>>>>>>> Attached patch dumps io page fault flags. The flags show the reason of
>>>>>>> the fault and tell us if this is an unmapped interrupt fault or a DMA 
>>>>>>> fault.
>>>>>>
>>>>>>> Thanks,
>>>>>>> Wei
>>>>>>
>>>>>>> signed-off-by: Wei Wang<wei.wang2@xxxxxxx>
>>>>>>
>>>>>>
>>>>>> I have applied the patch and the flags seem to differ between the faults:
>>>>>>
>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address = 
>>>>>> 0xc2c2c2c0, flags = 0x000
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id 
>>>>>> = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
>>>>>> id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
>>>>>> id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>>>
>>>>> OK, so they are not interrupt requests. I guess further information from
>>>>> your system would be helpful to debug this issue:
>>>>> 1) xl info
>>>>> 2) xl list
>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>>>
>>>> dom14 is not a HVM guest,it's a PV guest.
>>
>>> Ah, I see. PV guest is quite different than hvm, it does use p2m tables
>>> as io page tables. So no-sharept option does not work in this case. PV
>>> guests always use separated io page tables. There might be some
>>> incorrect mappings on the page tables. I will check this on my side.
>>
>> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept 
>> everything else the same.
>> I haven't seen any IO PAGE FAULTS after that.
>>
>> I did spot some differences in the output from lspci between xen 4.1 and 
>> 4.2, related to MSI enabled or not for the IOMMU device.
>> Have attached the xl/xm dmesg and lspci from booting with both versions.
>>
>> lspci:
>>
>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O 
>> Memory Management Unit (IOMMU) [1002:5a23]
>>          Subsystem: ATI Technologies Inc RD990 I/O Memory Management Unit 
>> (IOMMU) [1002:5a23]
>>          Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
>> Stepping- SERR- FastB2B- DisINTx-
>>          Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
>> DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>          Latency: 0
>>          Interrupt: pin A routed to IRQ 10
>>          Capabilities: [40] Secure device<?>
>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+

> Eh... That is interesting. So which dom0 are you using?  There is a c/s 
> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset 
> 25492:61844569a432) Otherwise, iommu cannot send any events including IO 
> PAGE faults. You could try to revert dom0 to an old version like 2.6 
> pv_ops to see if you really have no io page faults on 4.1

Ok i finally got the time to do some more testing, tested 4.2 around that 
changeset, and made a copy of the guest using HVM instead of PV.

The results:
- On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU):  the video device passed 
through works fine, both in a HVM as a PV guest, i don't see IO page faults 
getting reported.
- On xen-4.2 changeset <  25492 and a 3.6-rc6 kernel (dom0 and domU):  the 
video device passed through works fine, both in a HVM as a PV guest, i don't 
see IO page faults getting reported.
- On xen-4.2 changeset >  25492 and a 3.6-rc6 kernel (dom0 and domU): the video 
device passed through works fine for a short while (around 5 to 10 minutes) in 
a PV guest, after that IO page faults get reported and the video freezes, i 
don't see any errors in the guest though.
- On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU):
                                                      PV:  the video device 
passed through works fine for a short while (around 5 to 10 minutes), after 
that IO page faults get reported and the video freezes, i don't see any errors 
in the guest though.
                                                      HVM: the video device 
passed through doesn't work from the start:
                                                                     - The 
device is there according to lspci
                                                                     - The 
video application start fine, but delivers a green image, so the device is not 
working properly. I don't see IO page faults though.

Attached are (all with xen-unstable tip and the guest as HVM (domain 15):
- xl dmesg
- Patch which adds some more info, but all values reported seem to be zero (see 
xl dmesg)
- lspci dom0
- lspci HVM guest




>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>                  Address: 00000000fee0100c  Data: 4128
>>          Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+
>>
>> Although it seems enabled, shouldn't the IRQ number used be much higher than 
>> 10 for MSI interrupts ?

> The IRQ number is fine. MSI vector is stored at  Data: 4128

>>
>> There is another difference in the bridge device that's in front of the  
>> 0a:00.6 device that faults before the kernel is even booted.
>>
>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge (PCI 
>> express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>>          Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
>> Stepping- SERR+ FastB2B- DisINTx+
>> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
>> DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
>> DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>>          Latency: 0, Cache Line Size: 64 bytes
>>          Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
>>          I/O behind bridge: 0000f000-00000fff
>>          Memory behind bridge: f9f00000-f9ffffff
>>          Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>> 4.1:    Secondary status: 66MHz- FastB2B- ParErr- 
>> DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
>> 4.2:    Secondary status: 66MHz- FastB2B- ParErr- 
>> DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>>          BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort->Reset- FastB2B-
>>                  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>>          Capabilities: [50] Power Management version 3
>>                  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
>> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>                  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>>          Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
>>                  DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency 
>> L0s<64ns, L1<1us
>>                          ExtTag+ RBE+ FLReset-
>>                  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
>> Unsupported-
>>                          RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>                          MaxPayload 128 bytes, MaxReadReq 128 bytes
>>                  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- 
>> TransPend-
>>                  LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1, 
>> Latency L0<1us, L1<8us
>>                          ClockPM- Surprise- LLActRep+ BwNot+
>>                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- 
>> CommClk-
>>                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
>> DLActive+ BWMgmt+ ABWMgmt-
>>                  SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- 
>> Surprise-
>>                          Slot #3, PowerLimit 10.000W; Interlock- NoCompl+
>>                  SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- 
>> HPIrq- LinkChg-
>>                          Control: AttnInd Unknown, PwrInd Unknown, Power- 
>> Interlock-
>>                  SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ 
>> Interlock-
>>                          Changed: MRL- PresDet+ LinkState+

> The probably because of the IO_PAGE_FAULT.

> Thanks,
> Wei

>> serveerstertje:~# lspci -t
>> -[0000:00]-+-00.0
>>             +-00.2
>>             +-02.0-[0b]----00.0
>>             +-03.0-[0a]--+-00.0
>>             |            +-00.1
>>             |            +-00.2
>>             |            +-00.3
>>             |            +-00.4
>>             |            +-00.5
>>             |            +-00.6
>>             |            \-00.7
>>             +-05.0-[09]----00.0
>>             +-06.0-[08]----00.0
>>             +-0a.0-[07]----00.0
>>             +-0b.0-[06]--+-00.0
>>             |            \-00.1
>>             +-0c.0-[05]----00.0
>>             +-0d.0-[04]--+-00.0
>>             |            +-00.1
>>             |            +-00.2
>>             |            +-00.3
>>             |            +-00.4
>>             |            +-00.5
>>             |            +-00.6
>>             |            \-00.7
>>             +-11.0
>>             +-12.0
>>             +-12.2
>>             +-13.0
>>             +-13.2
>>             +-14.0
>>             +-14.3
>>             +-14.4-[03]----06.0
>>             +-14.5
>>             +-15.0-[02]--
>>             +-16.0
>>             +-16.2
>>             +-18.0
>>             +-18.1
>>             +-18.2
>>             +-18.3
>>             \-18.4
>>
>>
>>
>>
>>
>>> Thanks,
>>> Wei
>>
>>>> I will try to make a complete package, and try with one pv domain only 
>>>> where the devices are being passed through just to simplify the setup.
>>>>
>>>>
>>>>> * I would also like to know the symptoms of device 0x0700 when IO_PF
>>>>> happened. Did it stop working?
>>>>
>>>> Yes it stops working, the video capture just freezes, but the driver 
>>>> doesn't bail out.
>>>> For the USB controller (0x0a06) it starts to give errors for usbdev_open 
>>>> in the guest.
>>>>
>>>>> (BTW: I copied a few options from your boot cmd line and it worked with
>>>>> my RD890 system
>>>>
>>>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all console_timestamps
>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug
>>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>>
>>>>> * so, what OEM board you have?)
>>>>
>>>> MSI 890FXA-GD70
>>>>
>>>>> Also from your log, these lines looks very strange:
>>>>
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id
>>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>>>>> id = 0x0700, fault address = 0xa90f8300
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>>>>> id = 0x0700, fault address = 0xa90f8340
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>>>>> id = 0x0700, fault address = 0xa90f8380
>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>>>>> id = 0x0700, fault address = 0xa90f83c0
>>>>
>>>>> * they are just followed by the IO PAGE fault. Do you know where are
>>>>> they from? Your video card driver maybe?
>>>>
>>>>    From a HVM domain with a old (3.0.3) kernel, but the faults also occur 
>>>> without this domain being started.
>>>>
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>
>>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>>
>>>>>> Thx
>>>>>>
>>>>>> --
>>>>>> Sander
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>


Attachment: lspci-dom0.txt
Description: Text document

Attachment: lspci-hvm.txt
Description: Text document

Attachment: patch.diff
Description: Binary data

Attachment: xl-dmesg-hvm.txt
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.