[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

To: Jan Beulich <JBeulich@xxxxxxxx>
From: Haozhong Zhang <haozhong.zhang@xxxxxxxxx>
Date: Wed, 9 Mar 2016 20:22:59 +0800
Cc: Juergen Gross <JGross@xxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>
Delivery-date: Wed, 09 Mar 2016 12:23:14 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>
Mail-followup-to: Jan Beulich <JBeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, Juergen Gross <JGross@xxxxxxxx>, Keir Fraser <keir@xxxxxxx>

On 03/08/16 02:27, Jan Beulich wrote:
> >>> On 08.03.16 at 10:15, <haozhong.zhang@xxxxxxxxx> wrote:
> > More thoughts on reserving NVDIMM space for per-page structures
> > 
> > Currently, a per-page struct for managing mapping of NVDIMM pages may
> > include following fields:
> > 
> > struct nvdimm_page
> > {
> >     uint64_t mfn;        /* MFN of SPA of this NVDIMM page */
> >     uint64_t gfn;        /* GFN where this NVDIMM page is mapped */
> >     domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
> >     int      is_broken;  /* Is this NVDIMM page broken? (for MCE) */
> > }
> > 
> > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
> > nvdimm_page structures would occupy 12 GB space, which is too hard to
> > fit in the normal ram on a small memory host. However, for smaller
> > NVDIMMs and/or hosts with large ram, those structures may still be able
> > to fit in the normal ram. In the latter circumstance, nvdimm_page
> > structures are stored in the normal ram, so they can be accessed more
> > quickly.
> 
> Not sure how you came to the above structure - it's the first time
> I see it, yet figuring out what information it needs to hold is what
> this design process should be about. For example, I don't see why
> it would need to duplicate M2P / P2M information. Nor do I see why
> per-page data needs to hold the address of a page (struct
> page_info also doesn't). And whether storing a domain ID (rather
> than a pointer to struct domain, as in struct page_info) is the
> correct think is also to be determined (rather than just stated).
> 
> Otoh you make no provisions at all for any kind of ref counting.
> What if a guest wants to put page tables into NVDIMM space?
> 
> Since all of your calculations are based upon that fixed assumption
> on the structure layout, I'm afraid they're not very meaningful
> without first settling on what data needs tracking in the first place.
> 
> Jan
> 

I should reexplain the choice of data structures and where to put them.

For handling MCE for NVDIMM, we need to track following data:
(1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
    used to check whether a MCE is for NVDIMM.
(2) GFN to which a NVDIMM page is mapped, which is used to determine the
    address put in vMCE.
(3) the domain to which a NVDIMM page is mapped, which is used to
    determine whether a vMCE needs to be injected and where it will be
    injected.
(4) a flag to mark whether a NVDIMM page is broken, which is used to
    avoid mapping broken page to guests.

For granting NVDIMM pages (e.g. xen-blkback/netback),
(5) a reference counter is needed for each NVDIMM page

Above data can be organized as below:

* For (1) SPA ranges, we can record them in a global data structure,
  e.g. a list

    struct list_head nvdimm_iset_list;

    struct nvdimm_iset
    {
         uint64_t           base;  /* starting SPA of this interleave set */
         uint64_t           size;  /* size of this interleave set */
         struct nvdimm_page *pages;/* information for individual pages in this 
interleave set */
         struct list_head   list;
    };

* For (2) GFN, an intuitive place to get this information is from M2P
  table machine_to_phys_mapping[].  However, the address of NVDIMM is
  not required to be contiguous with normal ram, so, if NVDIMM starts
  from an address that is much higher than the end address of normal
  ram, it may result in a M2P table that maybe too large to fit in the
  normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
  table.

  Another possible solution is to extend page_info to include GFN for
  NVDIMM and use frame_table. A benefit of this solution is that other
  data (3)-(5) can be got from page_info as well. However, due to the
  same reason for machine_to_phys_mapping[] and the concern that the
  large number of page_info structures required for large NVDIMMs may
  consume lots of ram, page_info and frame_table seems not a good place
  either.

* At the end, we choose to introduce a new data structure for above
  per-page data (2)-(5)

    struct nvdimm_page
    {
        struct domain *domain;    /* for (3) */
        uint64_t      gfn;        /* for (2) */
        unsigned long count_info; /* for (4) and (5), same as 
page_info->count_info */
        /* other fields if needed, e.g. lock */
    }

  (MFN is not needed indeed)

  On each NVDIMM interleave set, we could reserve an area to place an
  array of nvdimm_page structures for pages in that interleave set. In
  addition, the corresponding global nvdimm_iset structure is set to
  point to this array via its 'pages' field.

* One disadvantage of above solution is that accessing NVDIMM is slower
  than normal ram, so some usage scenarios that requires frequently
  accesses to nvdimm_page structures may suffer poor
  performance. Therefore, we may add a boot parameter to allow users to
  choose normal ram for above nvdimm_page arrays if their hosts have
  plenty ram.

  One thing I have no idea is what percentage of ram used/reserved by
  Xen itself is considered as acceptable. If it exists and a boot
  parameter is given, we could let Xen choose the faster ram when
  the percentage has not been reached.

Any comments?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Jan Beulich

References:
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Haozhong Zhang
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Ian Jackson
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Konrad Rzeszutek Wilk
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Haozhong Zhang
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Jan Beulich
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Haozhong Zhang
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Haozhong Zhang
- Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
  - From: Jan Beulich

Prev by Date: Re: [Xen-devel] [PATCH v2 1/2] IOMMU/spinlock: Fix a bug found in AMD IOMMU initialization.
Next by Date: Re: [Xen-devel] [PATCH 2/4] x86: suppress SMAP and SMEP while running 32-bit PV guest code
Previous by thread: Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
Next by thread: Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.