[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

On 08/06/15 08:42, Jan Beulich wrote:
>>>> On 07.06.15 at 08:23, <mst@xxxxxxxxxx> wrote:
>> On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote:
>>> On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
>>>>>>> On 20.04.15 at 15:43, <mst@xxxxxxxxxx> wrote:
>>>>> On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
>>>>>>>>> On 13.04.15 at 14:47, <mst@xxxxxxxxxx> wrote:
>>>>>>> Can you check device capabilities register, offset 0x4 within
>>>>>>> pci express capability structure?
>>>>>>> Bit 15 is 15 Role-Based Error Reporting.
>>>>>>> Is it set?
>>>>>>> The spec says:
>>>>>>>         15
>>>>>>>         On platforms where robust error handling and PC-compatible 
>>>>>>> Configuration 
>>>>>>> Space probing is
>>>>>>>         required, it is suggested that software or firmware have the 
>>>>>>> Unsupported 
>>>>>>> Request Reporting Enable
>>>>>>>         bit Set for Role-Based Error Reporting Functions, but clear for 
>>>>>>> 1.0a 
>>>>>>> Functions. Software or
>>>>>>>         firmware can distinguish the two classes of Functions by 
>>>>>>> examining the 
>>>>>>> Role-Based Error Reporting
>>>>>>>         bit in the Device Capabilities register.
>>>>>> Yes, that bit is set.
>>>>> curiouser and curiouser.
>>>>> So with functions that do support Role-Based Error Reporting, we have
>>>>> this:
>>>>>   With device Functions implementing Role-Based Error Reporting, setting 
>>>>> the 
>>>>> Unsupported Request
>>>>>   Reporting Enable bit will not interfere with PC-compatible 
>>>>> Configuration 
>>>>> Space probing, assuming
>>>>>   that the severity for UR is left at its default of non-fatal. However, 
>>>>> setting the Unsupported Request
>>>>>   Reporting Enable bit will enable the Function to report UR errors 97 
>>>>> detected with posted Requests,
>>>>>   helping avoid this case for potential silent data corruption.
>>>> I still don't see what the PC-compatible config space probing has to
>>>> do with our issue.
>>> I'm not sure but I think it's listed here because it causes a ton of URs
>>> when device scan probes unimplemented functions.
>>>>> did firmware reconfigure this device to report URs as fatal errors then?
>>>> No, the Unsupported Request Error Serverity flag is zero.
>>> OK, that's the correct configuration, so how come the box crashes when
>>> there's a UR then?
>> Ping - any update on this?
> Not really. All we concluded so far is that _maybe_ the bridge, upon
> seeing the UR, generates a Master Abort, rendering the whole thing
> fatal. Otoh the respective root port also has
> - Received Master Abort set in its Secondary Status register (but
>   that's also already the case in the log that we have before the UR
>   occurs, i.e. that doesn't mean all that much),
> - Received System Error set in its Secondary Status register (and
>   after the UR the sibling endpoint [UR originating from 83:00.0,
>   sibling being 83:00.1] also shows Signaled System Error set).

Disabling the Memory decode in the command register could also result in a 
completion timeout on the
root port issuing a transaction towards the PCI device in question. PCIE 
completion timeouts can be
escalated to Fatal AER errors which trigger system firmware to inject NMI's 
into the host.

Unsupported requests can also be escalated to be Fatal AER errors (which would 
again trigger system
firmware to inject an NMI).

Here is an example AER setup for a PCIE root port. You can see UnsupReq errors 
are masked and so do
not trigger errors. CmpltTO ( completion timeout) errors are not masked and the 
errors are treated
as Fatal because the corresponding bit in the Uncorrectable Severity register 
is set.

Capabilities: [148 v1] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- 
UnsupReq- ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- 
UnsupReq+ ACSViol+
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- 
UnsupReq- ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-

A root port completion timeout will also result in the master abort bit being 

Typically system firmware clears the error in the AER registers after it's 
processed it. So the
operating system may not be able to determine what error triggered the NMI in 
the first place.

>> Do we can chalk this up to hardware bugs on a specific box?
> I have to admit that I'm still very uncertain whether to consider all
> this correct behavior, a firmware flaw, or a hardware bug.
I believe the correct behaviour is happening but a PCIE completion timeout is 
occurring instead of a
unsupported request.


> Jan
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.