[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

>>> On 09.03.16 at 13:22, <haozhong.zhang@xxxxxxxxx> wrote:
> On 03/08/16 02:27, Jan Beulich wrote:
>> >>> On 08.03.16 at 10:15, <haozhong.zhang@xxxxxxxxx> wrote:
>> > More thoughts on reserving NVDIMM space for per-page structures
>> > 
>> > Currently, a per-page struct for managing mapping of NVDIMM pages may
>> > include following fields:
>> > 
>> > struct nvdimm_page
>> > {
>> >     uint64_t mfn;        /* MFN of SPA of this NVDIMM page */
>> >     uint64_t gfn;        /* GFN where this NVDIMM page is mapped */
>> >     domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
>> >     int      is_broken;  /* Is this NVDIMM page broken? (for MCE) */
>> > }
>> > 
>> > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
>> > nvdimm_page structures would occupy 12 GB space, which is too hard to
>> > fit in the normal ram on a small memory host. However, for smaller
>> > NVDIMMs and/or hosts with large ram, those structures may still be able
>> > to fit in the normal ram. In the latter circumstance, nvdimm_page
>> > structures are stored in the normal ram, so they can be accessed more
>> > quickly.
>> Not sure how you came to the above structure - it's the first time
>> I see it, yet figuring out what information it needs to hold is what
>> this design process should be about. For example, I don't see why
>> it would need to duplicate M2P / P2M information. Nor do I see why
>> per-page data needs to hold the address of a page (struct
>> page_info also doesn't). And whether storing a domain ID (rather
>> than a pointer to struct domain, as in struct page_info) is the
>> correct think is also to be determined (rather than just stated).
>> Otoh you make no provisions at all for any kind of ref counting.
>> What if a guest wants to put page tables into NVDIMM space?
>> Since all of your calculations are based upon that fixed assumption
>> on the structure layout, I'm afraid they're not very meaningful
>> without first settling on what data needs tracking in the first place.
> I should reexplain the choice of data structures and where to put them.
> For handling MCE for NVDIMM, we need to track following data:
> (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
>     used to check whether a MCE is for NVDIMM.
> (2) GFN to which a NVDIMM page is mapped, which is used to determine the
>     address put in vMCE.
> (3) the domain to which a NVDIMM page is mapped, which is used to
>     determine whether a vMCE needs to be injected and where it will be
>     injected.
> (4) a flag to mark whether a NVDIMM page is broken, which is used to
>     avoid mapping broken page to guests.
> For granting NVDIMM pages (e.g. xen-blkback/netback),
> (5) a reference counter is needed for each NVDIMM page
> Above data can be organized as below:
> * For (1) SPA ranges, we can record them in a global data structure,
>   e.g. a list
>     struct list_head nvdimm_iset_list;
>     struct nvdimm_iset
>     {
>          uint64_t           base;  /* starting SPA of this interleave set */
>          uint64_t           size;  /* size of this interleave set */
>          struct nvdimm_page *pages;/* information for individual pages in 
> this interleave set */
>          struct list_head   list;
>     };
> * For (2) GFN, an intuitive place to get this information is from M2P
>   table machine_to_phys_mapping[].  However, the address of NVDIMM is
>   not required to be contiguous with normal ram, so, if NVDIMM starts
>   from an address that is much higher than the end address of normal
>   ram, it may result in a M2P table that maybe too large to fit in the
>   normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
>   table.

Any page that _may_ be used by a guest as normal RAM page
must have its mach->phys translation entered in the M2P. That's
because a r/o variant of that table is part of the hypervisor ABI
for PV guests. Size considerations simply don't apply here - the
table may be sparse (guests are required to deal with accesses
potentially faulting), and the 256Gb of virtual address space set
aside for it cover all memory up to the 47-bit boundary (there's
room for doubling this). Memory at addresses with bit 47 (or
higher) set would need a complete overhaul of that mechanism,
and whatever new mechanism we may pick would mean old
guests won#t be able to benefit.

>   Another possible solution is to extend page_info to include GFN for
>   NVDIMM and use frame_table. A benefit of this solution is that other
>   data (3)-(5) can be got from page_info as well. However, due to the
>   same reason for machine_to_phys_mapping[] and the concern that the
>   large number of page_info structures required for large NVDIMMs may
>   consume lots of ram, page_info and frame_table seems not a good place
>   either.

For this particular item struct page_info is the wrong place
anyway, due to what I've said above. Also extension
suggestions of struct page_info are quite problematic, as any
such implies a measurable increase on the memory overhead
the hypervisor incurs. Plus the structure right now is (with the
exception of the bigmem configuration) a carefully arranged
for power of two in size.

> * At the end, we choose to introduce a new data structure for above
>   per-page data (2)-(5)
>     struct nvdimm_page
>     {
>         struct domain *domain;    /* for (3) */
>         uint64_t      gfn;        /* for (2) */
>         unsigned long count_info; /* for (4) and (5), same as 
> page_info->count_info */
>         /* other fields if needed, e.g. lock */
>     }

So that again leaves unaddressed the question of what you
imply to do when a guest elects to use such a page as page
table. I'm afraid any attempt of yours to invent something that
is not struct page_info will not be suitable for all possible needs.

>   On each NVDIMM interleave set, we could reserve an area to place an
>   array of nvdimm_page structures for pages in that interleave set. In
>   addition, the corresponding global nvdimm_iset structure is set to
>   point to this array via its 'pages' field.

And I see no problem doing exactly that, just for an array of
struct page_info.

>   One thing I have no idea is what percentage of ram used/reserved by
>   Xen itself is considered as acceptable. If it exists and a boot
>   parameter is given, we could let Xen choose the faster ram when
>   the percentage has not been reached.

I think a conservative default would be to always place the
control structures in NVDIMM space, unless requested to be
put in RAM via command line option.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.