[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Commit 1aeb1156fa43fe2cd2b5003995b20466cd19a622: "x86 don't change affinity with interrupt unmasked", APCI errors and assorted pci trouble



Wednesday, April 1, 2015, 4:43:04 PM, you wrote:

> On 30/03/15 14:26, Sander Eikelenboom wrote:
>> Monday, March 30, 2015, 1:04:26 PM, you wrote:
>>
>>> On 28/03/15 20:10, Sander Eikelenboom wrote:
>>>> Saturday, March 28, 2015, 6:30:39 PM, you wrote:
>>>>
>>>>> On 28/03/15 15:34, Sander Eikelenboom wrote:
>>>>>> Hi Jan,
>>>>>>
>>>>>> Commit 1aeb1156fa43fe2cd2b5003995b20466cd19a622:
>>>>>> "x86 don't change affinity with interrupt unmasked",
>>>>>> gives trouble on my AMD box, symptoms:
>>>>>> - APIC errors in xl dmesg that weren't previously there:
>>>>>>   (XEN) [2015-03-26 20:35:37.085] IOAPIC[0]: Set PCI routing entry (6-13 
>>>>>> -> 0x88 -> IRQ 13 Mode:0 Active:0)
>>>>>>   (XEN) [2015-03-26 20:35:37.101] PCI: Using MCFG for segment 0000 bus 
>>>>>> 00-ff
>>>>>>   (XEN) [2015-03-26 20:35:37.097] IOAPIC[0]: Set PCI routing entry (6-8 
>>>>>> -> 0x58 -> IRQ 8 Mode:0 Active:0)
>>>>>>   (XEN) [2015-03-26 20:35:37.112] IOAPIC[0]: Set PCI routing entry (6-18 
>>>>>> -> 0xb8 -> IRQ 18 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.189] IOAPIC[0]: Set PCI routing entry (6-17 
>>>>>> -> 0xc0 -> IRQ 17 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-29 
>>>>>> -> 0xc8 -> IRQ 53 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-24 
>>>>>> -> 0xd0 -> IRQ 48 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-30 
>>>>>> -> 0xd8 -> IRQ 54 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-12 
>>>>>> -> 0x21 -> IRQ 36 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.420] IOAPIC[1]: Set PCI routing entry (7-13 
>>>>>> -> 0x29 -> IRQ 37 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.421] IOAPIC[1]: Set PCI routing entry (7-16 
>>>>>> -> 0x31 -> IRQ 40 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.495] IOAPIC[1]: Set PCI routing entry (7-28 
>>>>>> -> 0x39 -> IRQ 52 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.498] IOAPIC[0]: Set PCI routing entry (6-16 
>>>>>> -> 0x89 -> IRQ 16 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.498] IOAPIC[1]: Set PCI routing entry (7-14 
>>>>>> -> 0xa9 -> IRQ 38 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:37.548] IOAPIC[0]: Set PCI routing entry (6-22 
>>>>>> -> 0xb9 -> IRQ 22 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:39.620] IOAPIC[1]: Set PCI routing entry (7-9 
>>>>>> -> 0xc1 -> IRQ 33 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:39.646] IOAPIC[1]: Set PCI routing entry (7-8 
>>>>>> -> 0xc9 -> IRQ 32 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:39.647] IOAPIC[1]: Set PCI routing entry (7-23 
>>>>>> -> 0xd1 -> IRQ 47 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:41.732] IOAPIC[1]: Set PCI routing entry (7-5 
>>>>>> -> 0xd9 -> IRQ 29 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:41.779] IOAPIC[1]: Set PCI routing entry (7-4 
>>>>>> -> 0x22 -> IRQ 28 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:41.803] mm.c:803: d0: Forcing read-only access 
>>>>>> to MFN fed00
>>>>>>   (XEN) [2015-03-26 20:35:41.894] IOAPIC[0]: Set PCI routing entry (6-19 
>>>>>> -> 0x2a -> IRQ 19 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:42.057] IOAPIC[1]: Set PCI routing entry (7-22 
>>>>>> -> 0x72 -> IRQ 46 Mode:1 Active:1)
>>>>>>   (XEN) [2015-03-26 20:35:42.093] IOAPIC[1]: Set PCI routing entry (7-27 
>>>>>> -> 0x8a -> IRQ 51 Mode:1 Active:1)
>>>>>>
>>>>>>   these:
>>>>>>   (XEN) [2015-03-26 20:35:42.205] APIC error on CPU0: 00(40)
>>>>>>   (XEN) [2015-03-26 20:35:42.372] APIC error on CPU0: 40(40)
>>>>>>
>>>>>>   (XEN) [2015-03-26 20:35:42.691] d0 attempted to change d0v1's CR4 
>>>>>> flags 00000660 -> 00000760
>>>>>>   (XEN) [2015-03-26 20:35:42.691] IOAPIC[1]: Set PCI routing entry (7-1 
>>>>>> -> 0x9a -> IRQ 25 Mode:1 Active:1)
>>>>>>
>>>>>>   and this one:
>>>>>>   (XEN) [2015-03-26 20:35:42.707] APIC error on CPU0: 40(40)
>>>>>>   (XEN) [2015-03-26 20:35:43.958] d0 attempted to change d0v0's CR4 
>>>>>> flags 00000660 -> 00000760
>>>>>>   (XEN) [2015-03-26 20:35:43.970] d0 attempted to change d0v2's CR4 
>>>>>> flags 00000660 -> 00000760
>>>>>>   (XEN) [2015-03-26 20:35:43.988] d0 attempted to change d0v3's CR4 
>>>>>> flags 00000660 -> 00000760
>>>>>>   (XEN) [2015-03-26 20:35:43.992] d0 attempted to change d0v4's CR4 
>>>>>> flags 00000660 -> 00000760
>>>>>>   (XEN) [2015-03-26 20:35:43.996] d0 attempted to change d0v5's CR4 
>>>>>> flags 00000660 -> 00000760
>>>>>>   (d1) [2015-03-26 20:40:42.220] mapping kernel into physical memory
>>>>>>   (d1) [2015-03-26 20:40:42.220] about to get started...
>>>>>>
>>>>>>
>>>>>> - random failures on dom0 SATA devices, the SATA controller is using 
>>>>>> multiple MSI 
>>>>>>   interrupts.
>>>>>>
>>>>>> - failues on XHCI controllers passed through to a HVM guest which uses 
>>>>>> MSI-X
>>>>>>   interrupts. Leading to these in the guest dmesg:
>>>>>>   [  350.246548] xhci_hcd 0000:00:05.0: Looking for event-dma 
>>>>>> 000000003cdf7140 trb-start 000000003cdf7240 trb-end 000000003cdf7240 
>>>>>> seg-start 000000003cdf7000 seg-end 000000003cdf73f0
>>>>>>   [  350.246548] xhci_hcd 0000:00:05.0: ERROR Transfer event TRB DMA ptr 
>>>>>> not part of current TD ep_index 1 comp_code 1
>>>>>>   [  350.246548] xhci_hcd 0000:00:05.0: Looking for event-dma 
>>>>>> 000000003cdf7150 trb-start 000000003cdf7240 trb-end 000000003cdf7240 
>>>>>> seg-start 000000003cdf7000 seg-end 000000003cdf73f0
>>>>>>   [  350.246548] xhci_hcd 0000:00:05.0: ERROR Transfer event TRB DMA ptr 
>>>>>> not part of current TD ep_index 1 comp_code 1
>>>>>>
>>>>>>
>>>>>> Reverting this specific commit makes all the troubles go away ..
>>>>> That is unfortunate, as conceptually the identified patch definitely
>>>>> fixes a bug.
>>>>> The "APIC error" messages have bit 6 set, which is "Receive Illegal
>>>>> Vector".  i.e. a device has attempted to deliver an interrupt with a
>>>>> vector field less than 16.  I presume that this means that the device is
>>>>> ending up with a malformed data field programmed into it.
>>>>> Can you identify the PCI sbdf's of the problematic devices, and collect
>>>>> debug-keys Q, M and i on a working system so I can identify precisely
>>>>> which of the MSI interrupt drivers is in use (Xen has several, depending
>>>>> on exact hardware circumstance).  If you can, the same debug-keys with
>>>>> the problematic changeset present might also be interesting.
>>>>> ~Andrew
>>>> Hi Andrew,
>>>>
>>>> The passed through xhci is 08:00.0
>>>> The SATA controller is 00:11.0
>>>>
>>>> Most clear failure is on the xhci controller.
>>>>
>>>> The working and not working config only differ in the revert of the 
>>>> mentioned 
>>>> commit.
>>>>
>>>> Attached are:
>>>>
>>>> - lspci in dom0 of the working config 
>>>> - serial-log of the working config (with debug-keys Q, M and i after full 
>>>> boot 
>>>>   and guest start)
>>>> - serial-log of the not working config (with debug-keys Q, M and i after 
>>>> full 
>>>> boot and guest start)
>>> Thanks.
>>> As an utter longshot, can you give this patch a try?  Could you also see
>>> about capturing an lspci in dom0 while the bad situation is manifesting
>>> itself?
>>> ~Andrew
>> Hi Andrew,
>>
>> lspci of the not working case attached, there are some differences
>> compared to the working case, but on other device than i expected.
>> (btw i'm running with the ivrs_ioapic[6]=00:14.0 override due to 
>> the bios tables not properly specifying the SB ioapic.)
>>
>> I tried the patch, but couldn't notice any difference,
>> lspci output was exactly the same as of the not working case
>> that is attached.

> I still can't find a plausible reason for this failure, given the
> change, which suggest that it might be a pre-existing subtle issue
> uncovered by the change.

> ~Andrew

At the moment running memtest .. just to rule that out.
You never know if April fools day .. is fooling me with
pretending to be friday the 13th or something like that.

So please ignore for the moment ...

--
Sander


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.