[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] amd iommu: Dump flags of IO page faults



Monday, September 24, 2012, 2:24:16 PM, you wrote:

> On 09/24/2012 10:38 AM, Sander Eikelenboom wrote:
>>
>> Friday, September 7, 2012, 10:54:40 AM, you wrote:
>>
>>> On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:
>>>>
>>>> Thursday, September 6, 2012, 5:03:05 PM, you wrote:
>>>>
>>>>> On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:
>>>>>>
>>>>>> Thursday, September 6, 2012, 3:32:51 PM, you wrote:
>>>>>>
>>>>>>> On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:
>>>>>>>>
>>>>>>>> Wednesday, September 5, 2012, 4:42:42 PM, you wrote:
>>>>>>>>
>>>>>>>>> Hi Jan,
>>>>>>>>> Attached patch dumps io page fault flags. The flags show the reason of
>>>>>>>>> the fault and tell us if this is an unmapped interrupt fault or a DMA 
>>>>>>>>> fault.
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Wei
>>>>>>>>
>>>>>>>>> signed-off-by: Wei Wang<wei.wang2@xxxxxxx>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have applied the patch and the flags seem to differ between the 
>>>>>>>> faults:
>>>>>>>>
>>>>>>>> AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address = 
>>>>>>>> 0xc2c2c2c0, flags = 0x000
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0, device 
>>>>>>>> id = 0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
>>>>>>>> id = 0x0700, fault address = 0xa8d339e0, flags = 0x020
>>>>>>>> (XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device 
>>>>>>>> id = 0x0700, fault address = 0xa8d33a40, flags = 0x020
>>>>>>
>>>>>>> OK, so they are not interrupt requests. I guess further information from
>>>>>>> your system would be helpful to debug this issue:
>>>>>>> 1) xl info
>>>>>>> 2) xl list
>>>>>>> 3) lscpi -vvv (NOTE: not in dom0 but in your guest)
>>>>>>> 4) cat /proc/iomem (in both dom0 and your hvm guest)
>>>>>>
>>>>>> dom14 is not a HVM guest,it's a PV guest.
>>>>
>>>>> Ah, I see. PV guest is quite different than hvm, it does use p2m tables
>>>>> as io page tables. So no-sharept option does not work in this case. PV
>>>>> guests always use separated io page tables. There might be some
>>>>> incorrect mappings on the page tables. I will check this on my side.
>>>>
>>>> I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept 
>>>> everything else the same.
>>>> I haven't seen any IO PAGE FAULTS after that.
>>>>
>>>> I did spot some differences in the output from lspci between xen 4.1 and 
>>>> 4.2, related to MSI enabled or not for the IOMMU device.
>>>> Have attached the xl/xm dmesg and lspci from booting with both versions.
>>>>
>>>> lspci:
>>>>
>>>> 00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O 
>>>> Memory Management Unit (IOMMU) [1002:5a23]
>>>>           Subsystem: ATI Technologies Inc RD990 I/O Memory Management Unit 
>>>> (IOMMU) [1002:5a23]
>>>>           Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- 
>>>> ParErr- Stepping- SERR- FastB2B- DisINTx-
>>>>           Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
>>>> DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>>           Latency: 0
>>>>           Interrupt: pin A routed to IRQ 10
>>>>           Capabilities: [40] Secure device<?>
>>>> 4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+
>>
>>> Eh... That is interesting. So which dom0 are you using?  There is a c/s
>>> in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset
>>> 25492:61844569a432) Otherwise, iommu cannot send any events including IO
>>> PAGE faults. You could try to revert dom0 to an old version like 2.6
>>> pv_ops to see if you really have no io page faults on 4.1
>>
>> Ok i finally got the time to do some more testing, tested 4.2 around that 
>> changeset, and made a copy of the guest using HVM instead of PV.
>>
>> The results:
>> - On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU):  the video device 
>> passed through works fine, both in a HVM as a PV guest, i don't see IO page 
>> faults getting reported.
>> - On xen-4.2 changeset<   25492 and a 3.6-rc6 kernel (dom0 and domU):  the 
>> video device passed through works fine, both in a HVM as a PV guest, i don't 
>> see IO page faults getting reported.
>> - On xen-4.2 changeset>   25492 and a 3.6-rc6 kernel (dom0 and domU): the 
>> video device passed through works fine for a short while (around 5 to 10 
>> minutes) in a PV guest, after that IO page faults get reported and the video 
>> freezes, i don't see any errors in the guest though.
>> - On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU):
>>                                                        PV:  the video device 
>> passed through works fine for a short while (around 5 to 10 minutes), after 
>> that IO page faults get reported and the video freezes, i don't see any 
>> errors in the guest though.
>>                                                        HVM: the video device 
>> passed through doesn't work from the start:
>>                                                                       - The 
>> device is there according to lspci
>>                                                                       - The 
>> video application start fine, but delivers a green image, so the device is 
>> not working properly. I don't see IO page faults though.
>>
>> Attached are (all with xen-unstable tip and the guest as HVM (domain 15):
>> - xl dmesg
>> - Patch which adds some more info, but all values reported seem to be zero 
>> (see xl dmesg)
>> - lspci dom0
>> - lspci HVM guest

> HI,
> Thanks for the information, very very helpful for debugging. I hope I 
> could start to look at this right after sending my next iommu patch 
> queue upstream...another question is: Did you see this issue on a single 
> pv/hvm guest system or you only saw it on a system with about 16 running 
> VMs?

The issue of the hvm not giving a video image also happens when it's the first 
and only guest running after a cold boot.

> Thanks,
> Wei

>>
>>
>>
>>>> 4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>>                   Address: 00000000fee0100c  Data: 4128
>>>>           Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+
>>>>
>>>> Although it seems enabled, shouldn't the IRQ number used be much higher 
>>>> than 10 for MSI interrupts ?
>>
>>> The IRQ number is fine. MSI vector is stored at  Data: 4128
>>
>>>>
>>>> There is another difference in the bridge device that's in front of the  
>>>> 0a:00.6 device that faults before the kernel is even booted.
>>>>
>>>> 00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge 
>>>> (PCI express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
>>>>           Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
>>>> ParErr- Stepping- SERR+ FastB2B- DisINTx+
>>>> 4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
>>>> DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
>>>> 4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
>>>> DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
>>>>           Latency: 0, Cache Line Size: 64 bytes
>>>>           Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
>>>>           I/O behind bridge: 0000f000-00000fff
>>>>           Memory behind bridge: f9f00000-f9ffffff
>>>>           Prefetchable memory behind bridge: 
>>>> 00000000fff00000-00000000000fffff
>>>> 4.1:    Secondary status: 66MHz- FastB2B- ParErr- 
>>>> DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
>>>> 4.2:    Secondary status: 66MHz- FastB2B- ParErr- 
>>>> DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
>>>>           BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort->Reset- FastB2B-
>>>>                   PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>>>>           Capabilities: [50] Power Management version 3
>>>>                   Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
>>>> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>>>>                   Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
>>>>           Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
>>>>                   DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency 
>>>> L0s<64ns, L1<1us
>>>>                           ExtTag+ RBE+ FLReset-
>>>>                   DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
>>>> Unsupported-
>>>>                           RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>>>>                           MaxPayload 128 bytes, MaxReadReq 128 bytes
>>>>                   DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- 
>>>> TransPend-
>>>>                   LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1, 
>>>> Latency L0<1us, L1<8us
>>>>                           ClockPM- Surprise- LLActRep+ BwNot+
>>>>                   LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- 
>>>> CommClk-
>>>>                           ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>>>>                   LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
>>>> DLActive+ BWMgmt+ ABWMgmt-
>>>>                   SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- 
>>>> Surprise-
>>>>                           Slot #3, PowerLimit 10.000W; Interlock- NoCompl+
>>>>                   SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- 
>>>> HPIrq- LinkChg-
>>>>                           Control: AttnInd Unknown, PwrInd Unknown, Power- 
>>>> Interlock-
>>>>                   SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- 
>>>> PresDet+ Interlock-
>>>>                           Changed: MRL- PresDet+ LinkState+
>>
>>> The probably because of the IO_PAGE_FAULT.
>>
>>> Thanks,
>>> Wei
>>
>>>> serveerstertje:~# lspci -t
>>>> -[0000:00]-+-00.0
>>>>              +-00.2
>>>>              +-02.0-[0b]----00.0
>>>>              +-03.0-[0a]--+-00.0
>>>>              |            +-00.1
>>>>              |            +-00.2
>>>>              |            +-00.3
>>>>              |            +-00.4
>>>>              |            +-00.5
>>>>              |            +-00.6
>>>>              |            \-00.7
>>>>              +-05.0-[09]----00.0
>>>>              +-06.0-[08]----00.0
>>>>              +-0a.0-[07]----00.0
>>>>              +-0b.0-[06]--+-00.0
>>>>              |            \-00.1
>>>>              +-0c.0-[05]----00.0
>>>>              +-0d.0-[04]--+-00.0
>>>>              |            +-00.1
>>>>              |            +-00.2
>>>>              |            +-00.3
>>>>              |            +-00.4
>>>>              |            +-00.5
>>>>              |            +-00.6
>>>>              |            \-00.7
>>>>              +-11.0
>>>>              +-12.0
>>>>              +-12.2
>>>>              +-13.0
>>>>              +-13.2
>>>>              +-14.0
>>>>              +-14.3
>>>>              +-14.4-[03]----06.0
>>>>              +-14.5
>>>>              +-15.0-[02]--
>>>>              +-16.0
>>>>              +-16.2
>>>>              +-18.0
>>>>              +-18.1
>>>>              +-18.2
>>>>              +-18.3
>>>>              \-18.4
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> Wei
>>>>
>>>>>> I will try to make a complete package, and try with one pv domain only 
>>>>>> where the devices are being passed through just to simplify the setup.
>>>>>>
>>>>>>
>>>>>>> * I would also like to know the symptoms of device 0x0700 when IO_PF
>>>>>>> happened. Did it stop working?
>>>>>>
>>>>>> Yes it stops working, the video capture just freezes, but the driver 
>>>>>> doesn't bail out.
>>>>>> For the USB controller (0x0a06) it starts to give errors for usbdev_open 
>>>>>> in the guest.
>>>>>>
>>>>>>> (BTW: I copied a few options from your boot cmd line and it worked with
>>>>>>> my RD890 system
>>>>>>
>>>>>>> dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all console_timestamps
>>>>>>> cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug
>>>>>>> apic=debug iommu=on,verbose,debug,no-sharept
>>>>>>
>>>>>>> * so, what OEM board you have?)
>>>>>>
>>>>>> MSI 890FXA-GD70
>>>>>>
>>>>>>> Also from your log, these lines looks very strange:
>>>>>>
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xd5, mfn=0xa4a11
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xd7, mfn=0xa4a0f
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xd9, mfn=0xa4a0d
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xdb, mfn=0xa4a0b
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xdd, mfn=0xa4a09
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xdf, mfn=0xa4a07
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xe1, mfn=0xa4a05
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xe3, mfn=0xa4a03
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xe5, mfn=0xa4a01
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xe7, mfn=0xa463f
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xe9, mfn=0xa463d
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xeb, mfn=0xa463b
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xed, mfn=0xa4639
>>>>>>> (XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
>>>>>>> read-only memory page. gfn=0xef, mfn=0xa4637
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id
>>>>>>> = 0x0a06, fault address = 0xc2c2c2c0
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8300
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8340
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f8380
>>>>>>> (XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
>>>>>>> id = 0x0700, fault address = 0xa90f83c0
>>>>>>
>>>>>>> * they are just followed by the IO PAGE fault. Do you know where are
>>>>>>> they from? Your video card driver maybe?
>>>>>>
>>>>>>      From a HVM domain with a old (3.0.3) kernel, but the faults also 
>>>>>> occur without this domain being started.
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Wei
>>>>>>
>>>>>>
>>>>>>>> Complete xl dmesg and lspci -vvvknn attached.
>>>>>>>>
>>>>>>>> Thx
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sander
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.