[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 0/2] MCA support with page offlining



Hi Wang,

Thank you very much for your comments.
I will post an updated patch later.

Thanks,
KAZ

From: "Wang, Shane" <shane.wang@xxxxxxxxx>
Subject: RE: [Xen-devel] [PATCH 0/2] MCA support with page offlining
Date: Mon, 22 Dec 2008 09:52:28 +0800

> Hi KAZ,
> 
> For your latest patch, I am thinking of the following code.
> 
> +     /* check whether a page number have been already registered or not */
> +     list_for_each_entry(page, &page_offlining, list) {
> +             if (page == pg)
> +                     goto out;
> +     }
> 
> x86_page_offlining_t e is put into the list. Here, is it "if (e->page == pg)"?
> 
> +
> +     /* check whether already having attribute 'reserved' */
> +     if (pg->count_info & PGC_reserved) {
> +             printk(XENLOG_DEBUG "Page offlining: ( %lx ) failure.\n",
> +                    maddr);
> +             return 1;
> +     }
> +
> +     /* add attribute 'reserved' and register the page */
> +     get_page(pg, d);
> +     pg->count_info |= PGC_reserved;
> +
> +     e = xmalloc(x86_page_offlining_t);
> +     BUG_ON(e == NULL);
> +     e->page = pg;
> +     list_add(&e->list, &page_offlining);
> 
> Since page struct is also a list entry (containing "struct list_head list"). 
> You can pick off from domain page list and put it onto your page_offlining 
> list.
> Certainly, the domain may be using it now. You can mark it by using your flag 
> PGC_reserved first and then do this in the free function. I think this is 
> what Jiang Yunhong suggested:)
> What do you think on those suggestions? PS: in that case, 
> alloc_bitmap[]/heap[][][]/avail[] should be considered.
> 
> For the code piece
> 
> +                if ( !list_empty(&heap(node, zone, j)) ) {
> +                    pg = list_entry(heap(node, zone, j).next, struct 
> page_info, list);
> +                    if (!(pg->count_info & PGC_reserved))
> +                        goto found;
> +                    else
> +                        printk(XENLOG_DEBUG "Page %p(%lx) is not to be 
> allocated.\n",
> +                               pg, page_to_maddr(pg));
> +                
> 
> It seems this mechanism is not efficient enough since the first page is 
> marked PGC_reserved (but the rest pages in heap(node, zone, j).next are not, 
> the rest pages can't be used and allocated any more although they are not 
> offlined), and it didn't solve the potential bug I mentioned.
> Moreover, since the original code (i.e. without your patch) just tries to 
> find a space with an enough order, it just gets the first finding. So it is 
> enough. However, I am wondering why you don't try to enumerate heap(node, 
> zone, j) to get the second one ... but directly return "not to be allocated", 
> when the code finds the first list entry can't meet the requirement?
> 
> Thanks.
> Shane
> 
> -----Original Message-----
> From: SUZUKI Kazuhiro [mailto:kaz@xxxxxxxxxxxxxx] 
> Sent: 2008年12月22日 8:42
> To: Jiang, Yunhong
> Cc: Wang, Shane; xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] [PATCH 0/2] MCA support with page offlining
> 
> Hi,
> 
> > But I'm a bit confused some changes in this patch.
> 
> Sorry, I attached a wrong patch, this is a correct one.
> 
> Thanks,
> KAZ
> 
> Signed-off-by: Kazuhiro Suzuki <kaz@xxxxxxxxxxxxxx>
> 
> 
> From: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
> Subject: RE: [Xen-devel] [PATCH 0/2] MCA support with page offlining
> Date: Fri, 19 Dec 2008 18:48:11 +0800
> 
> > SUZUKI Kazuhiro <mailto:kaz@xxxxxxxxxxxxxx> wrote:
> > > Hi Yunhong and Shane,
> > > 
> > > Thank you for your comments and reviews.
> > > 
> > >> For page offlining, It may not be a good way to put the page into an 
> > >> array
> > >> but a list. 
> > > 
> > > I modified to register offlined pages to a list instead of an array.
> > > 
> > >> I guess the code targets for offlining domain pages only,
> > > right? How about free pages and xen pages?
> > >> If so, no need to check in the following when allocating
> > > free pages, since the offlined pages will not be freed into heap()()().
> > >> If not, the following may have a bug.
> > > 
> > > Yes, I assumed that offlining page was needed for domain pages. If xen
> > > pages are impacted, then it is enough to crash the Hypervisor, in current
> > > implementation. 
> > 
> > We have a internal patch for the similar purpose, for  page offline caused 
> > by both #MC or other purpose. We can base our work on your patch if needed.
> > 
> > But I'm a bit confused some changes in this patch.
> > In your previous version, if a page is marked as PGC_reserved, it will not 
> > be allocated anymore. However, in this version, when a page is marked as 
> > PGC_reserved, it is just not freed, so does that mean the page will not be 
> > removed from current list? That seems a bit hack for me.
> > 
> > What I think to do is for page offline is:
> > a) mark a page as PGC_reserved (or other name like PGC_broken )
> > b) If it is free, to step c, otherwise, wait till it is freed by the owner 
> > and to step c.
> > c) Remove the page from the buddy system, motve it to a special and 
> > seperated list (i.e. not in the heap[][][] anymore), and return other page 
> > to the buddy allocator.
> > 
> > Some argument is in step b, that if the page is owned by a guest, we can 
> > replace it with a new page through p2m table, and don't need wait till it 
> > is freed, we didn't do that currently because it is a bit complex to 
> > achieve that.
> > 
> > How is your idea on this?
> > 
> > Thanks
> > Yunhong Jiang
> > 
> > > 
> > > I attach an updated patch for xen part which also includes some bug fixes.
> > > 
> > > Thanks,
> > > KAZ
> > > 
> > > Signed-off-by: Kazuhiro Suzuki <kaz@xxxxxxxxxxxxxx>
> > > 
> > > 
> > > From: "Wang, Shane" <shane.wang@xxxxxxxxx>
> > > Subject: RE: [Xen-devel] [PATCH 0/2] MCA support with page offlining
> > > Date: Tue, 16 Dec 2008 19:10:00 +0800
> > > 
> > >> For page offlining, It may not be a good way to put the page into an 
> > >> array
> > >> but a list. 
> > >> 
> > >> +        pg->count_info |= PGC_reserved;
> > >> +        page_offlining[num_page_offlining++] = pg;
> > >> 
> > >> I guess the code targets for offlining domain pages only,
> > > right? How about free pages and xen pages?
> > >> If so, no need to check in the following when allocating
> > > free pages, since the offlined pages will not be freed into heap()()().
> > >> If not, the following may have a bug.
> > >> 
> > >> +                if ( !list_empty(&heap(node, zone, j)) ) {
> > >> +                    pg = list_entry(heap(node, zone,
> > > j).next, struct page_info, list);
> > >> +                    if (!(pg->count_info & PGC_reserved))
> > >> +                        goto found;
> > >> +                    else
> > >> +                        printk(XENLOG_DEBUG "Page %p(%lx) is not to be
> > >> allocated.\n", +                               pg, page_to_maddr(pg)); + 
> > >>  
> > >> } 
> > >> 
> > >> If one free page (not pg) within pg and pg+(1U<<j) is
> > > offlined, the range pg~pg+(1U<<j) has the risk to be allocated
> > > with that page.
> > >> 
> > >> Shane
> > >> 
> > >> Jiang, Yunhong wrote:
> > >>> xen-devel-bounces@xxxxxxxxxxxxxxxxxxx <> wrote:
> > >>>> Hi all,
> > >>>> 
> > >>>> I had posted about MCA support for Intel64 before. It had only a
> > >>>> function to log the MCA error data received from hypervisor.
> > >>>> 
> > >>>> http://lists.xensource.com/archives/html/xen-devel/2008-09/msg0 
> > >>>> 0876.html
> > >>>> 
> > >>>> I attach patches that support not only error logging but also Page
> > >>>> Offlining function. The page where an MCA occurs will offline and not
> > >>>> reuse. A new flag 'PGC_reserved' was added in page count_info to mark
> > >>>> the impacted page. 
> > >>>> 
> > >>>> I know that it is better to implement the page offlining for general
> > >>>> purpose, but I implemented for MCA specialized in this first step.
> > >>> 
> > >>> Maybe the MCA page offline is a bit different to normal page offline
> > >>> requirement, so take it as first step maybe a good choice :)
> > >>> 
> > >>> As for your current page_offlining, I'm not sure why the PGC_reserved
> > >>> page should not be freed? Also, for following code, will that make
> > >>> the heap(node, zone, j) can't be allocated anymore? Maybe we can
> > >>> creat a special list to hold all those pages and remove them from the
> > >>> heap list? 
> > >>> 
> > >>> +                if ( !list_empty(&heap(node, zone, j)) ) {
> > >>> +                    pg = list_entry(heap(node, zone, j).next, struct
> > >>> page_info, list); +                    if (!(pg->count_info &
> > >>> PGC_reserved)) +                        goto found; +                   
> > >>> else +                        printk(XENLOG_DEBUG "Page %p(%lx) is not 
> > >>> to
> > >>> be allocated.\n", +                               pg, 
> > >>> page_to_maddr(pg));
> > >>> + 
> > >>> 
> > >>> 
> > >>>> 
> > >>>> And I also implement the MCA handler of Dom0 which support to
> > >>>> shutdown the remote domain where a MCA occurred. If the MCA occurred
> > >>>> on a DomU, Dom0 notifies it to the DomU. When the notify is failed,
> > >>>> Dom0 calls SCHEDOP_remote_shutdown hypercall.
> > >>>> 
> > >>>> [1/2] xen part: mca-support-with-page-offlining-xen.patch
> > >>> 
> > >>> We are not sure we really need pass all #MC information to dom0
> > >>> firstly, and let dom0 to notify domU. Xen should knows about
> > >>> everything, so it may have knowledge to decide inject virtual #MC to
> > >>> guest or not. Of course, this does not impact your patch.
> > >>> 
> > >>>> [2/2] linux/x86_64 part:
> > > mca-support-with-page-offlining-linux.patch
> > >>> 
> > >>> As for how to inject virtual #MC to guest (including dom0), I think
> > >>> we need consider following point:
> > >>> 
> > >>> a) Benefit from reusing guest #MC handler's . #MC handler is well
> > >>> known difficult to test, and the native guest handler may have been
> > >>> tested more widely. Also #MC handler improves as time going-on, reuse
> > >>> guest's MCA handler share us those improvement.
> > >>> b) Maintain the PV handler to different OS version may not so easy,
> > >>> especially as hardware improves, and kernel may have better support
> > >>> for error handling/containment.
> > >>> c) #MC handler may need some model specific information to decide the
> > >>> action, while guest (not dom0) has virtualized CPUID information.
> > >>> d) Guest's MCA handler may requires the physical information when the
> > >>> #MC hapen, like the CPU number the #MC happens.
> > >>> e) For HVM domain, PV handler will be difficult (considering Windows
> > >>> guest). 
> > >>> 
> > >>> And we have several option to support virtual #MC to guest:
> > >>> 
> > >>> Option 1 is what currently implemented. A PV #MC handler is
> > >>> implemented in guest. This PV handler gets MCA information from Xen
> > >>> HV through hypercall, including MCA MSR value, also some additional
> > >>> information, like which physical CPU the MCA happened. Option 1 will
> > >>> help us on issue d), but we need main a PV handler, and can't get
> > >>> benifit from native handler. Also it does not resolve issue c) quite 
> > >>> well.
> > >>> 
> > >>> option 2, Xen will provide MCA MSR virtualization so that guest's
> > >>> native #MC handler can run without changes. It can benifit from guest
> > >>> #MC handler, but it will be difficult to get model specific
> > >>> information, and has no physical information.
> > >>> 
> > >>> Option 3 uses a PV #MC handler for guest as option 1, but interface
> > >>> between Xen/guest is abstract event, like offline offending page,
> > >>> terminate current execution context etc. This should be straight
> > >>> forward for Linux, but may be difficult to Windows and other OS.
> > >>> 
> > >>> Currently we are considering option 2 to provide MCA MSR
> > >>> virtualization to guest, and dom0 can also benifit from such support
> > >>> (if guest has different CPUID as native, we will either keep guest
> > >>> running, or kill guest based on error code). Of course, current
> > >>> mechanism of passing MCA information from xen to dom0 will still be
> > >>> useful, but that will be used for logging purpose or for Correcatable
> > >>> Error. How do you think about this?
> > >>> 
> > >>> Thanks
> > >>> Yunhong Jiang
> > >>> 
> > >>>> Signed-off-by: Kazuhiro Suzuki <kaz@xxxxxxxxxxxxxx>
> > >>>> 
> > >>>> Thanks,
> > >>>> KAZ
> > >>>> 
> > >>>> _______________________________________________
> > >>>> Xen-devel mailing list
> > >>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> > >>>> http://lists.xensource.com/xen-devel
> > >>> _______________________________________________
> > >>> Xen-devel mailing list
> > >>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> > >>> http://lists.xensource.com/xen-devel
> > >> 
> > >> 
> > >> _______________________________________________
> > >> Xen-devel mailing list
> > >> Xen-devel@xxxxxxxxxxxxxxxxxxx
> > >> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.