[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN



Christopher/Egger, thanks for reply very much, see comments below.

>-----Original Message-----
>From: Frank.Vanderlinden@xxxxxxx [mailto:Frank.Vanderlinden@xxxxxxx] 
>Sent: 2009年2月26日 1:33
>To: Christoph Egger
>Cc: Jiang, Yunhong; Kleen, Andi; 
>xen-devel@xxxxxxxxxxxxxxxxxxx; Keir Fraser; Ke, Liping; Gavin Maltby
>Subject: Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
>
>Christoph Egger wrote:
>> On Wednesday 25 February 2009 03:25:12 Jiang, Yunhong wrote:
>>
>>> So, Frank/Egger, can I assume followed are consensus currently?
>>>
>>> 1) MCE is handled by Xen HV totally, while guest's vMCE 
>handler will only
>>> works for itself.
>>> 2) Xen present a virtual #MC to guest through MSR access  
>>> emulation.(Xen will do the translation if needed).
>>> 3) Guest's unmodified 
>>> MCE handler will handle the vMCE injected.
>>> 4) Dom0 will get all log/telemetry through hypercall.
>>> 5) The action taken by xen will be passed to dom0 through 
>the telemetry
>>> mechanism.
>> 
>> Mostly. Regarding 2) I want like to discuss first how to 
>handle errors
>> impacting multiple contiguous physical pages which are non-contigous
>> in guest physical space.


>> 
>> And I also want to discuss about how to do recovery actions requiring
>> PCI access. One example for this is
>> Shanghai's "L3 Cache Index Disable"-Feature.
>> Xen delegates PCI config space to Dom0 and
>> via PCI passthrough partly to DomU.
>> That means, if registers in PCI config space are independently
>> accessable by Xen, Dom0 and/or DomU, they can interfere with 
>each other.
>> Therefore, we need to
>> a) clearly define who handles what and
>> b) define some rules based on a)
>> c) discuss how to handle Dom0/DomU going wild
>>     and break the rules defined in b)
>
>I also agree on the approach in principle, but would like to see these 
>points addressed. For non-contiguous pages, I suppose Xen 
>could deliver 
>multiple #vMCEs to the guest, split into contiguous parts. The 
>vmce code 
>seems to be set up to be able to do this.

For the contigous pages, I agree with Gavin that such contiguous page error 
should be triggered as multiple #MC and so is ok.

For PCI config space issue, Christoph, can you please share more information on 
it (or provide some document as Frank suggested), like is it for CE 
(Correctable error or UC(UnCorrectable error), is it in PCI range or PCI-E 
range (i.e. through 0xCF8/CFC or through MMCONFIG), how the device's BDF 
caculated etc. Followed is some of my understanding.

Firstly, if it is CE, Xen will do nothing and dom0 will take recovery action. 
If it is UC, Xen will take action when all CPU is in SoftIRQ context, and dom0 
will not take action, so it should be ok. 

Secondly, in Xen environment, per my understanding, CPU is owned by Xen HV, so 
I'm not sure when dom0 disable L3 cache (if it is CE), should Xen be aware or 
not. That is, should dom0 disable the cache directly, or it should user 
hypercall to ask Xen do that. Keir can give us more suggestion.

For item C, currently Xen/dom0 can both access configuration space, while domU 
will do that through PCI_frontend/backend. Because PCI backend only cover 
device assigned to domU, so we don't need worry about domU and dom0 should be 
trusted. However, one thing left is, if this range is beyond 0x100 (i.e. in 
pci-e range), we need add mmconfig support in Xen, although it can be added 
simply.

Thanks
-- Yunhong Jiang

>
>As for the Shanghai feature: Christoph, are there any documents 
>available on that feature? What kind of errors are delivered 
>(corrected/correctable)?
>
>- Frank
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.