[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen



> From: Zhang, Haozhong
> Sent: Monday, February 01, 2016 1:44 PM
> 
[...]
> 
> 1.2 ACPI Support
> 
>  ACPI provides two factors of support for NVDIMM. First, NVDIMM
>  devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
>  NVDIMM Firmware Interface Table (NFIT). Second, several functions of
>  NVDIMM, including operations on namespace labels, S.M.A.R.T and
>  hotplug, are provided by ACPI methods (_DSM and _FIT).
> 
> 1.2.1 NFIT
> 
>  NFIT is a new system description table added in ACPI v6 with
>  signature "NFIT". It contains a set of structures.

Can I consider only NFIT as a minimal requirement, while other stuff
(_DSM and _FIT) are optional?

> 
> 
> 2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
> 
> 2.1 NVDIMM Driver in Linux Kernel
> 
[...]
> 
>  Userspace applications can mmap(2) the whole pmem into its own
>  virtual address space. Linux kernel maps the system physical address
>  space range occupied by pmem into the virtual address space, so that every
>  normal memory loads/writes with proper flushing instructions are
>  applied to the underlying pmem NVDIMM regions.
> 
>  Alternatively, a DAX file system can be made on /dev/pmemX. Files on
>  that file system can be used in the same way as above. As Linux
>  kernel maps the system address space range occupied by those files on
>  NVDIMM to the virtual address space, reads/writes on those files are
>  applied to the underlying NVDIMM regions as well.

Does it mean only file-based interface is supported by Linux today, and 
pmem aware application cannot use normal memory allocation interface
like malloc for the purpose?

> 
> 2.2 vNVDIMM Implementation in KVM/QEMU
> 
>  (1) Address Mapping
> 
>   As described before, the host Linux NVDIMM driver provides a block
>   device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
>   region. QEMU can than mmap(2) that device into its virtual address
>   space (buf). QEMU is responsible to find a proper guest physical
>   address space range that is large enough to hold /dev/pmem0. Then
>   QEMU passes the virtual address of mmapped buf to a KVM API
>   KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
>   address range of buf to the guest physical address space range where
>   the virtual pmem device will be.
> 
>   In this way, all guest writes/reads on the virtual pmem device is
>   applied directly to the host one.
> 
>   Besides, above implementation also allows to back a virtual pmem
>   device by a mmapped regular file or a piece of ordinary ram.

What's the point of backing pmem with ordinary ram? I can buy-in
the value of file-backed option which although slower does sustain
the persistency attribute. However with ram-backed method there's
no persistency so violates guest expectation.

btw, how is persistency guaranteed in KVM/QEMU, cross guest 
power off/on? I guess since Qemu process is killed the allocated pmem
will be freed so you may switch to file-backed method to keep 
persistency (however copy would take time for large pmem trunk). Or
will you find some way to keep pmem managed separated from qemu
qemu life-cycle (then pmem is not efficiently reused)?

> 3. Design of vNVDIMM in Xen
> 
> 3.2 Address Mapping
> 
> 3.2.2 Alternative Design
> 
>  Jan Beulich's comments [7] on my question "why must pmem resource
>  management and partition be done in hypervisor":
>  | Because that's where memory management belongs. And PMEM,
>  | other than PBLK, is just another form of RAM.
>  | ...
>  | The main issue is that this would imo be a layering violation
> 
>  George Dunlap's comments [8]:
>  | This is not the case for PMEM.  The whole point of PMEM (correct me if
>    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
>  | I'm wrong) is to be used for long-term storage that survives over
>  | reboot.  It matters very much that a guest be given the same PRAM
>  | after the host is rebooted that it was given before.  It doesn't make
>  | any sense to manage it the way Xen currently manages RAM (i.e., that
>  | you request a page and get whatever Xen happens to give you).
>  |
>  | So if Xen is going to use PMEM, it will have to invent an entirely new
>  | interface for guests, and it will have to keep track of those
>  | resources across host reboots.  In other words, it will have to
>  | duplicate all the work that Linux already does.  What do we gain from
>  | that duplication?  Why not just leverage what's already implemented in
>  | dom0?
>  and [9]:
>  | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
>  | then you're right -- it is just another form of RAM, that should be
>  | treated no differently than say, lowmem: a fungible resource that can be
>  | requested by setting a flag.
> 
>  However, pmem is used more as persistent storage than fungible ram,
>  and my design is for the former usage. I would like to leave the
>  detection, driver and partition (either through namespace or file
>  systems) of NVDIMM in Dom0 Linux kernel.

After reading the whole introduction I vote for this option too. One immediate
reason why a resource should be managed in Xen, is whether Xen itself also
uses it, e.g. normal RAM. In that case Xen has to control the whole resource to 
protect itself from Dom0 and other user VMs. Given a resource not used by
Xen completely, it's reasonable to leave it to Dom0 which reduces code 
duplication
and unnecessary maintenance burden in Xen side, as we have done for whole
PCI sub-system and other I/O peripherals. I'm not sure whether there's future
value to use pmem in Xen itself, at least for now the primary requirement is 
about exposing pmem to guest. From that angle reusing NVDIMM driver in
Dom0 looks the better choice with less enabling effort to catch up with KVM.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.