[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] amd iommu: Dump flags of IO page faults



On 09/24/2012 10:38 AM, Sander Eikelenboom wrote:

Friday, September 7, 2012, 10:54:40 AM, you wrote:

On 09/07/2012 09:32 AM, Sander Eikelenboom wrote:

Thursday, September 6, 2012, 5:03:05 PM, you wrote:

On 09/06/2012 03:50 PM, Sander Eikelenboom wrote:

Thursday, September 6, 2012, 3:32:51 PM, you wrote:

On 09/06/2012 12:59 AM, Sander Eikelenboom wrote:

Wednesday, September 5, 2012, 4:42:42 PM, you wrote:

Hi Jan,
Attached patch dumps io page fault flags. The flags show the reason of
the fault and tell us if this is an unmapped interrupt fault or a DMA fault.

Thanks,
Wei

signed-off-by: Wei Wang<wei.wang2@xxxxxxx>


I have applied the patch and the flags seem to differ between the faults:

AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0x0a06, fault address = 
0xc2c2c2c0, flags = 0x000
(XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 
0x0a06, fault address = 0xc2c2c2c0, flags = 0x000
(XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id = 
0x0700, fault address = 0xa8d339e0, flags = 0x020
(XEN) [2012-09-05 20:54:16] AMD-Vi: IO_PAGE_FAULT: domain = 14, device id = 
0x0700, fault address = 0xa8d33a40, flags = 0x020

OK, so they are not interrupt requests. I guess further information from
your system would be helpful to debug this issue:
1) xl info
2) xl list
3) lscpi -vvv (NOTE: not in dom0 but in your guest)
4) cat /proc/iomem (in both dom0 and your hvm guest)

dom14 is not a HVM guest,it's a PV guest.

Ah, I see. PV guest is quite different than hvm, it does use p2m tables
as io page tables. So no-sharept option does not work in this case. PV
guests always use separated io page tables. There might be some
incorrect mappings on the page tables. I will check this on my side.

I have reverted the machine to xen-4.1.4-pre (changeset 23353) and kept 
everything else the same.
I haven't seen any IO PAGE FAULTS after that.

I did spot some differences in the output from lspci between xen 4.1 and 4.2, 
related to MSI enabled or not for the IOMMU device.
Have attached the xl/xm dmesg and lspci from booting with both versions.

lspci:

00:00.2 Generic system peripheral [0806]: ATI Technologies Inc RD990 I/O Memory 
Management Unit (IOMMU) [1002:5a23]
          Subsystem: ATI Technologies Inc RD990 I/O Memory Management Unit 
(IOMMU) [1002:5a23]
          Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
          Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
          Latency: 0
          Interrupt: pin A routed to IRQ 10
          Capabilities: [40] Secure device<?>
4.1:    Capabilities: [54] MSI: Enable- Count=1/1 Maskable- 64bit+

Eh... That is interesting. So which dom0 are you using?  There is a c/s
in 4.2 to prevent recent dom0 to disable iommu interrupt (changeset
25492:61844569a432) Otherwise, iommu cannot send any events including IO
PAGE faults. You could try to revert dom0 to an old version like 2.6
pv_ops to see if you really have no io page faults on 4.1

Ok i finally got the time to do some more testing, tested 4.2 around that 
changeset, and made a copy of the guest using HVM instead of PV.

The results:
- On xen-4.1.* and a 3.6-rc6 kernel (dom0 and domU):  the video device passed 
through works fine, both in a HVM as a PV guest, i don't see IO page faults 
getting reported.
- On xen-4.2 changeset<   25492 and a 3.6-rc6 kernel (dom0 and domU):  the 
video device passed through works fine, both in a HVM as a PV guest, i don't see 
IO page faults getting reported.
- On xen-4.2 changeset>   25492 and a 3.6-rc6 kernel (dom0 and domU): the video 
device passed through works fine for a short while (around 5 to 10 minutes) in a 
PV guest, after that IO page faults get reported and the video freezes, i don't 
see any errors in the guest though.
- On xen-unstable tip and a 3.6-rc6 kernel (dom0 and domU):
                                                       PV:  the video device 
passed through works fine for a short while (around 5 to 10 minutes), after 
that IO page faults get reported and the video freezes, i don't see any errors 
in the guest though.
                                                       HVM: the video device 
passed through doesn't work from the start:
                                                                      - The 
device is there according to lspci
                                                                      - The 
video application start fine, but delivers a green image, so the device is not 
working properly. I don't see IO page faults though.

Attached are (all with xen-unstable tip and the guest as HVM (domain 15):
- xl dmesg
- Patch which adds some more info, but all values reported seem to be zero (see 
xl dmesg)
- lspci dom0
- lspci HVM guest

HI,
Thanks for the information, very very helpful for debugging. I hope I could start to look at this right after sending my next iommu patch queue upstream...another question is: Did you see this issue on a single pv/hvm guest system or you only saw it on a system with about 16 running VMs?

Thanks,
Wei




4.2:    Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
                  Address: 00000000fee0100c  Data: 4128
          Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+

Although it seems enabled, shouldn't the IRQ number used be much higher than 10 
for MSI interrupts ?

The IRQ number is fine. MSI vector is stored at  Data: 4128


There is another difference in the bridge device that's in front of the  
0a:00.6 device that faults before the kernel is even booted.

00:03.0 PCI bridge [0604]: ATI Technologies Inc RD890 PCI to PCI bridge (PCI 
express gpp port C) [1002:5a17] (prog-if 00 [Normal decode])
          Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx+
4.1:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
DEVSEL=fast>TAbort-<TAbort-<MAbort->SERR-<PERR- INTx-
4.2:    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- 
DEVSEL=fast>TAbort-<TAbort+<MAbort->SERR-<PERR- INTx-
          Latency: 0, Cache Line Size: 64 bytes
          Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
          I/O behind bridge: 0000f000-00000fff
          Memory behind bridge: f9f00000-f9ffffff
          Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
4.1:    Secondary status: 66MHz- FastB2B- ParErr- 
DEVSEL=fast>TAbort-<TAbort-<MAbort-<SERR-<PERR-
4.2:    Secondary status: 66MHz- FastB2B- ParErr- 
DEVSEL=fast>TAbort+<TAbort-<MAbort-<SERR-<PERR-
          BridgeCtl: Parity+ SERR+ NoISA+ VGA- MAbort->Reset- FastB2B-
                  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
          Capabilities: [50] Power Management version 3
                  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
          Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                  DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s<64ns, 
L1<1us
                          ExtTag+ RBE+ FLReset-
                  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
                          RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                          MaxPayload 128 bytes, MaxReadReq 128 bytes
                  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- 
TransPend-
                  LnkCap: Port #1, Speed 5GT/s, Width x8, ASPM L0s L1, Latency 
L0<1us, L1<8us
                          ClockPM- Surprise- LLActRep+ BwNot+
                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- 
CommClk-
                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
DLActive+ BWMgmt+ ABWMgmt-
                  SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- 
Surprise-
                          Slot #3, PowerLimit 10.000W; Interlock- NoCompl+
                  SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- 
HPIrq- LinkChg-
                          Control: AttnInd Unknown, PwrInd Unknown, Power- 
Interlock-
                  SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ 
Interlock-
                          Changed: MRL- PresDet+ LinkState+

The probably because of the IO_PAGE_FAULT.

Thanks,
Wei

serveerstertje:~# lspci -t
-[0000:00]-+-00.0
             +-00.2
             +-02.0-[0b]----00.0
             +-03.0-[0a]--+-00.0
             |            +-00.1
             |            +-00.2
             |            +-00.3
             |            +-00.4
             |            +-00.5
             |            +-00.6
             |            \-00.7
             +-05.0-[09]----00.0
             +-06.0-[08]----00.0
             +-0a.0-[07]----00.0
             +-0b.0-[06]--+-00.0
             |            \-00.1
             +-0c.0-[05]----00.0
             +-0d.0-[04]--+-00.0
             |            +-00.1
             |            +-00.2
             |            +-00.3
             |            +-00.4
             |            +-00.5
             |            +-00.6
             |            \-00.7
             +-11.0
             +-12.0
             +-12.2
             +-13.0
             +-13.2
             +-14.0
             +-14.3
             +-14.4-[03]----06.0
             +-14.5
             +-15.0-[02]--
             +-16.0
             +-16.2
             +-18.0
             +-18.1
             +-18.2
             +-18.3
             \-18.4





Thanks,
Wei

I will try to make a complete package, and try with one pv domain only where 
the devices are being passed through just to simplify the setup.


* I would also like to know the symptoms of device 0x0700 when IO_PF
happened. Did it stop working?

Yes it stops working, the video capture just freezes, but the driver doesn't 
bail out.
For the USB controller (0x0a06) it starts to give errors for usbdev_open in the 
guest.

(BTW: I copied a few options from your boot cmd line and it worked with
my RD890 system

dom0_mem=1024M,max:1024M loglvl=all loglvl_guest=all console_timestamps
cpuidle cpufreq=xen noreboot debug lapic=debug apic_verbosity=debug
apic=debug iommu=on,verbose,debug,no-sharept

* so, what OEM board you have?)

MSI 890FXA-GD70

Also from your log, these lines looks very strange:

(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xd5, mfn=0xa4a11
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xd7, mfn=0xa4a0f
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xd9, mfn=0xa4a0d
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xdb, mfn=0xa4a0b
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xdd, mfn=0xa4a09
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xdf, mfn=0xa4a07
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xe1, mfn=0xa4a05
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xe3, mfn=0xa4a03
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xe5, mfn=0xa4a01
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xe7, mfn=0xa463f
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xe9, mfn=0xa463d
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xeb, mfn=0xa463b
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xed, mfn=0xa4639
(XEN) [2012-09-04 15:54:35] hvm.c:2435:d15 guest attempted write to
read-only memory page. gfn=0xef, mfn=0xa4637
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 0, device id
= 0x0a06, fault address = 0xc2c2c2c0
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
id = 0x0700, fault address = 0xa90f8300
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
id = 0x0700, fault address = 0xa90f8340
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
id = 0x0700, fault address = 0xa90f8380
(XEN) [2012-09-04 16:13:56] AMD-Vi: IO_PAGE_FAULT: domain = 14, device
id = 0x0700, fault address = 0xa90f83c0

* they are just followed by the IO PAGE fault. Do you know where are
they from? Your video card driver maybe?

     From a HVM domain with a old (3.0.3) kernel, but the faults also occur 
without this domain being started.


Thanks,
Wei


Complete xl dmesg and lspci -vvvknn attached.

Thx

--
Sander












_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.