[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen


The following document describes the design of adding vNVDIMM support
for Xen. Any comments are welcome.


1. Background
 1.1 Access Mechanisms: Persistent Memory and Block Window
 1.2 ACPI Support
  1.2.1 NFIT
  1.2.2 _DSM and _FIT
 1.3 Namespace
 1.4 clwb/clflushopt/pcommit
2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
 2.1 NVDIMM Driver in Linux Kernel
 2.2 vNVDIMM Implementation in KVM/QEMU
3. Design of vNVDIMM in Xen
 3.1 Guest clwb/clflushopt/pcommit Enabling
 3.2 Address Mapping
  3.2.1 My Design
  3.2.2 Alternative Design
 3.3 Guest ACPI Emulation
  3.3.1 My Design
  3.3.2 Alternative Design 1: switching to QEMU
  3.3.3 Alternative Design 2: keeping in Xen

Non-Volatile DIMM or NVDIMM is a type of RAM device that provides
persistent storage and retains data across reboot and even power
failures. This document describes the design to support virtual NVDIMM
devices or vNVDIMM in Xen. 

The rest of this document is organized as below.
 - Section 1 briefly introduces the background knowledge of NVDIMM
   hardware, which is used by other parts of this document.

 - Section 2 briefly introduces the current/future NVDIMM/vNVDIMM
   support in Linux kernel/KVM/QEMU. They will affect the vNVDIMM
   design in Xen.

 - Section 3 proposes design details of vNVDIMM in Xen. Several
   alternatives are also listed in this section.

1. Background

1.1 Access Mechanisms: Persistent Memory and Block Window

 NVDIMM provides two access mechanisms: byte-addressable persistent
 memory (pmem) and block window (pblk). An NVDIMM can contain multiple
 ranges and each range can be accessed through either pmem or pblk
 (but not both).

 Byte-addressable persistent memory mechanism (pmem) maps NVDIMM or
 ranges of NVDIMM into the system physical address (SPA) space, so
 that software can access NVDIMM via normal memory loads and
 stores. If the virtual address is used, then MMU will translate it to
 the physical address.

 In the virtualization circumstance, we can pass through a pmem range
 or partial of it to a guest by mapping it in EPT (i.e. mapping guest
 vNVDIMM physical address to host NVDIMM physical address), so that
 guest accesses are applied directly to the host NVDIMM device without
 hypervisor's interceptions.

 Block window mechanism (pblk) provides one or multiple block windows
 (BW).  Each BW is composed of a command register, a status register
 and a 8 Kbytes aperture register. Software fills the direction of the
 transfer (read/write), the start address (LBA) and size on NVDIMM it
 is going to transfer. If nothing goes wrong, the transferred data can
 be read/write via the aperture register. The status and errors of the
 transfer can be got from the status register. Other vendor-specific
 commands and status can be implemented for BW as well. Details of the
 block window access mechanism can be found in [3].

 In the virtualization circumstance, different pblk regions on a
 single NVDIMM device may be accessed by different guests, so the
 hypervisor needs to emulate BW, which would introduce a high overhead
 for I/O intensive workload.

 Therefore, we are going to only implement pmem for vNVDIMM. The rest
 of this document will mostly concentrate on pmem.

1.2 ACPI Support

 ACPI provides two factors of support for NVDIMM. First, NVDIMM
 devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
 NVDIMM Firmware Interface Table (NFIT). Second, several functions of
 NVDIMM, including operations on namespace labels, S.M.A.R.T and
 hotplug, are provided by ACPI methods (_DSM and _FIT).

1.2.1 NFIT

 NFIT is a new system description table added in ACPI v6 with
 signature "NFIT". It contains a set of structures.

 - System Physical Address Range Structure
   (SPA Range Structure)

   SPA range structure describes system physical address ranges
   occupied by NVDIMMs and types of regions.

   If Address Range Type GUID field of a SPA range structure is "Byte
   Addressable Persistent Memory (PM) Region", then the structure
   describes a NVDIMM region that is accessed via pmem. The System
   Physical Address Range Base and Length fields describe the start
   system physical address and the length that is occupied by that
   NVDIMM region.

   A SPA range structure is identified by a non-zero SPA range
   structure index.

   Note: [1] reserves E820 type 7: OSPM must comprehend this memory as
         having non-volatile attributes and handle distinct from
         conventional volatile memory (in Table 15-312 of [1]). The
         memory region supports byte-addressable non-volatility. E820
         type 12 (OEM defined) may be also used for legacy NVDIMM
         prior to ACPI v6.

   Note: Besides OS, EFI firmware may also parse NFIT for booting
         drives (Section of [5]).

 - Memory Device to System Physical Address Range Mapping Structure
   (Range Mapping Structure)

   An NVDIMM region described by a SPA range structure can be
   interleaved across multiple NVDIMM devices. A range mapping
   structure is used to describe the single mapping on each NVDIMM
   device. It describes the size and the offset in a SPA range that an
   NVDIMM device occupies. It may refer to an Interleave Structure
   that contains details of the entire interleave set. Those
   information is used in pblk by the NVDIMM driver for address

   The NVDIMM device described by the range mapping structure is
   identified by an unique NFIT Device Handle.

 Details of NFIT and other structures can be found in Section 5.25 in [1].

1.2.2 _DSM and _FIT

 The ACPI namespace device uses _HID of ACPI0012 to identify the root
 NVDIMM interface device. An ACPI namespace device is also present
 under the root device For each NVDIMM device. Above ACPI namespace
 devices are defined in SSDT.

 _DSM methods are present under the root device and each NVDIMM
 device. _DSM methods are used by drivers to access the label storage
 area, get health information, perform vendor-specific commands,
 etc. Details of all _DSM methods can be found in [4].

 _FIT method is under the root device and evaluated by OSPM to get
 NFIT of hotplugged NVDIMM. The hotplugged NVDIMM is indicated to OS
 using ACPI Namespace device with PNPID of PNP0C80 and the device
 object notification value is 0x80. Details of NVDIMM hotplug can be
 found in Section 9.20 of [1].

1.3 Namespace

 [2] describes a mechanism to sub-divide NVDIMMs into namespaces,
 which are logic units of storage similar to SCSI LUNs and NVM Express

 The namespace information is describes by namespace labels stored in
 the persistent label storage area on each NVDIMM device. The label
 storage area is excluded from the the range mapped by the SPA range
 structure and can only be accessed via _DSM methods.

 There are two types of namespaces defined in [2]: the persistent
 memory namespace and the block namespaces. Persistent memory
 namespaces is built for only pmem NVDIMM regions, while block
 namespaces only for pblk. Only one persistent memory namespace is
 allowed for a pmem NVDIMM region.

 Besides being accessed via _DSM, namespaces are managed and
 interpreted by software. OS vendors may decide to not follow [2] and
 store other types of information in the label storage area.

1.4 clwb/clflushopt/pcommit

 Writes to NVDIMM may be cached by caches, so certain flushing
 operations should be performed to make them persistent on
 NVDIMM. clwb is used in favor of clflushopt and clflush to flush
 writes from caches to memory. Then a following pcommit makes them
 finally persistent (power failure protected) on NVDIMM.

 Details of clwb/clflushopt/pcommit can be found in Chapter 10 of [6].

2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU

2.1 NVDIMM Driver in Linux Kernel

 Linux kernel since 4.2 has added support for ACPI-defined NVDIMM

 NVDIMM driver in Linux probes NVDIMM devices through ACPI (i.e. NFIT
 and _FIT). It is also responsible to parse the namsepace labels on
 each NVDIMM devices, recover namespace after power failure (as
 described in [2]) and handle NVDIMM hotplug. There are also some
 vendor drivers to perform vendor-specific operations on NVDIMMs
 (e.g. via _DSM).

 Compared to the ordinary ram, NVDIMM is used more like a persistent
 storage drive for its persistent aspect. For each persistent memory
 namespace, or a label-less pmem NVDIMM range, NVDIMM driver
 implements a block device interface (/dev/pmemX) for it.

 Userspace applications can mmap(2) the whole pmem into its own
 virtual address space. Linux kernel maps the system physical address
 space range occupied by pmem into the virtual address space, so that every
 normal memory loads/writes with proper flushing instructions are
 applied to the underlying pmem NVDIMM regions.

 Alternatively, a DAX file system can be made on /dev/pmemX. Files on
 that file system can be used in the same way as above. As Linux
 kernel maps the system address space range occupied by those files on
 NVDIMM to the virtual address space, reads/writes on those files are
 applied to the underlying NVDIMM regions as well.

2.2 vNVDIMM Implementation in KVM/QEMU

 An overview of vNVDIMM implementation in KVM (Linux kernel v4.2) / QEMU (commit
 70d1fb9 and patches in-review/future) is showed by the following figure.

 Guest                             GPA |                    | /dev/pmem0 |
           parse        evaluate                            ^            ^
            ACPI          _DSM                              |            |
              |            |                                |            |
              V            V                                |            |
          +-------+    +-------+                            |            |
 QEMU     | vACPI |    | v_DSM |                            |            |
          +-------+    +-------+                            |            |
                           ^                                |            |
                           | Read/Write                     |            |
                           V                                |            |
          +...+--------------------+...+-----------+        |            |
    VA    |   | Label Storage Area |   |    buf    |  
          +...+--------------------+...+-----------+        |            |
                                       ^  mmap(2)  ^        |            |
                                       |           +--------~------------+
                                       |                    |            |
 Linux/KVM                             +--------------------+            |
                                                            |            |
                                                SPA    |    | /dev/pmem0 |
                                                            Host NVDIMM Driver
 HW                                                          +------------+
                                                             |   NVDIMM   |

 A part not put in above figure is enabling guest clwb/clflushopt/pcommit
 which exposes those instructions to guest via guest cpuid.

 Besides instruction enabling, there are two primary parts of vNVDIMM
 implementation in KVM/QEMU.

 (1) Address Mapping

  As described before, the host Linux NVDIMM driver provides a block
  device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
  region. QEMU can than mmap(2) that device into its virtual address
  space (buf). QEMU is responsible to find a proper guest physical
  address space range that is large enough to hold /dev/pmem0. Then
  QEMU passes the virtual address of mmapped buf to a KVM API
  KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
  address range of buf to the guest physical address space range where
  the virtual pmem device will be.

  In this way, all guest writes/reads on the virtual pmem device is
  applied directly to the host one.

  Besides, above implementation also allows to back a virtual pmem
  device by a mmapped regular file or a piece of ordinary ram.

 (2) Guest ACPI Emulation

  As guest system physical address and the size of the virtual pmem
  device are determined by QEMU, QEMU is responsible to emulate the
  guest NFIT and SSDT. Basically, it builds the guest NFIT and its
  sub-structures that describes the virtual NVDIMM topology, and a
  guest SSDT that defines ACPI namespace devices of virtual NVDIMM in
  guest SSDT.

  As a portion of host pmem device or a regular file/ordinary file can
  be used to back the guest pmem device, the label storage area on
  host pmem cannot always be passed through to guest. Therefore, the
  guest reads/writes on the label storage area is emulated by QEMU. As
  described before, _DSM method is utilized by OSPM to access the
  label storage area, and therefore it is emulated by QEMU. The _DSM
  buffer is registered as MMIO, and its guest physical address and
  size are described in the guest ACPI. Every command/status
  read/write from guest is trapped and emulated by QEMU.

  Guest _FIT method will be implemented similarly in the future.

3. Design of vNVDIMM in Xen

 Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
 three parts:
 (1) Guest clwb/clflushopt/pcommit enabling,
 (2) Memory mapping, and
 (3) Guest ACPI emulation.

 The rest of this section present the design of each part
 respectively. The basic design principle to reuse existing code in
 Linux NVDIMM driver and QEMU as much as possible. As recent
 discussions in the both Xen and QEMU mailing lists for the v1 patch
 series, alternative designs are also listed below.

3.1 Guest clwb/clflushopt/pcommit Enabling

 The instruction enabling is simple and we do the same work as in KVM/QEMU.
 - All three instructions are exposed to guest via guest cpuid.
 - L1 guest pcommit is never intercepted by Xen.
 - L1 hypervisor is allowed to intercept L2 guest pcommit.

3.2 Address Mapping

3.2.1 My Design

 The overview of this design is shown in the following figure.

                 Dom0                         |               DomU
 QEMU                                         |
     +...+--------------------+...+-----+     |
  VA |   | Label Storage Area |   | buf |     |
     +...+--------------------+...+-----+     |
                     ^            ^     ^     |
                     |            |     |     |
                     V            |     |     |
     +-------+   +-------+        mmap(2)     |
     | vACPI |   | v_DSM |        |     |     |        +----+------------+
     +-------+   +-------+        |     |     |   SPA  |    | /dev/pmem0 |
         ^           ^     +------+     |     |        +----+------------+
 --------|-----------|-----|------------|--   |             ^            ^
         |           |     |            |     |             |            |
         |    +------+     +------------~-----~-------------+            |
         |    |            |            |     |        XEN_DOMCTL_memory_mapping
         |    |            |            +-----~--------------------------+
         |    |            |            |     |
         |    |       +----+------------+     |
 Linux   |    |   SPA |    | /dev/pmem0 |     |     +------+   +------+
         |    |       +----+------------+     |     | ACPI |   | _DSM |
         |    |                   ^           |     +------+   +------+
         |    |                   |           |         |          |
         |    |               Dom0 Driver     |   hvmloader/xl     |
         |    +-------------------~---------------------~----------+
 Xen     |                        |                     |
 HW                                         |    NVDIMM   |

 This design treats host NVDIMM devices as ordinary MMIO devices:
 (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
     and drive host NVDIMM devices (implementing block device
     interface). Namespaces and file systems on host NVDIMM devices
     are handled by Dom0 Linux as well.

 (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
     virtual address space (buf).

 (3) QEMU gets the host physical address of buf, i.e. the host system
     physical address that is occupied by /dev/pmem0, and calls Xen
     hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.

 (ACPI part is described in Section 3.3 later)

 Above (1)(2) have already been done in current QEMU. Only (3) is
 needed to implement in QEMU. No change is needed in Xen for address
 mapping in this design.

 Open: It seems no system call/ioctl is provided by Linux kernel to
       get the physical address from a virtual address.
       /proc/<qemu_pid>/pagemap provides information of mapping from
       VA to PA. Is it an acceptable solution to let QEMU parse this
       file to get the physical address?

 Open: For a large pmem, mmap(2) is very possible to not map all SPA
       occupied by pmem at the beginning, i.e. QEMU may not be able to
       get all SPA of pmem from buf (in virtual address space) when
       calling XEN_DOMCTL_memory_mapping.
       Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
       entire pmem being mmaped?

3.2.2 Alternative Design

 Jan Beulich's comments [7] on my question "why must pmem resource
 management and partition be done in hypervisor":
 | Because that's where memory management belongs. And PMEM,
 | other than PBLK, is just another form of RAM.
 | ...
 | The main issue is that this would imo be a layering violation

 George Dunlap's comments [8]:
 | This is not the case for PMEM.  The whole point of PMEM (correct me if
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ used as fungible ram
 | I'm wrong) is to be used for long-term storage that survives over
 | reboot.  It matters very much that a guest be given the same PRAM
 | after the host is rebooted that it was given before.  It doesn't make
 | any sense to manage it the way Xen currently manages RAM (i.e., that
 | you request a page and get whatever Xen happens to give you).
 | So if Xen is going to use PMEM, it will have to invent an entirely new
 | interface for guests, and it will have to keep track of those
 | resources across host reboots.  In other words, it will have to
 | duplicate all the work that Linux already does.  What do we gain from
 | that duplication?  Why not just leverage what's already implemented in
 | dom0?
 and [9]:
 | Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
 | then you're right -- it is just another form of RAM, that should be
 | treated no differently than say, lowmem: a fungible resource that can be
 | requested by setting a flag.

 However, pmem is used more as persistent storage than fungible ram,
 and my design is for the former usage. I would like to leave the
 detection, driver and partition (either through namespace or file
 systems) of NVDIMM in Dom0 Linux kernel.

 I notice that current XEN_DOMCTL_memory_mapping does not make santiy
 check for the physical address and size passed from caller
 (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
 aware of the SPA range of pmem so that it can refuse map physical
 address in neither the normal ram nor pmem.

 Instead of duplicating the detection code (parsing NFIT and
 evaluating _FIT) in Dom0 Linux kernel, we decide to patch Dom0 Linux
 kernel to pass parameters of host pmem NVDIMM devices to Xen
 (1) Add a global
       struct rangeset pmem_rangeset
     in Xen hypervisor to record all SPA ranges of detected pmem devices.
     Each range in pmem_rangeset corresponds to a pmem device.

 (2) Add a hypercall
     (should it be a sysctl or a platform op?)
     that receives a pair of parameters (addr: starting SPA of pmem
     region, len: size of pmem region) and add a range (addr, addr +
     len - 1) in nvdimm_rangset.

 (3) Add a hypercall
     that takes the same parameters as XEN_DOMCTL_memory_mapping and
     maps a given host pmem range to guest. It checks whether the
     given host pmem range is in the pmem_rangeset before making the
     actual mapping.

 (4) Patch Linux NVDIMM driver to call XEN_SYSCTL_add_pmem_range
     whenever it detects a pmem device.

 (5) Patch QEMU to use XEN_DOMCTL_pmem_mapping for mapping host pmem

3.3 Guest ACPI Emulation

3.3.1 My Design

 Guest ACPI emulation is composed of two parts: building guest NFIT
 and SSDT that defines ACPI namespace devices for NVDIMM, and
 emulating guest _DSM.

 (1) Building Guest ACPI Tables

  This design reuses and extends hvmloader's existing mechanism that
  loads passthrough ACPI tables from binary files to load NFIT and
  SSDT tables built by QEMU:
  1) Because the current QEMU does not building any ACPI tables when
     it runs as the Xen device model, this design needs to patch QEMU
     to build NFIT and SSDT (so far only NFIT and SSDT) in this case.

  2) QEMU copies NFIT and SSDT to the end of guest memory below
     4G. The guest address and size of those tables are written into
     xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).

  3) hvmloader is patched to probe and load device model passthrough
     ACPI tables from above xenstore keys. The detected ACPI tables
     are then appended to the end of existing guest ACPI tables just
     like what current construct_passthrough_tables() does.

  Reasons for this design are listed below:
  - NFIT and SSDT in question are quite self-contained, i.e. they do
    not refer to other ACPI tables and not conflict with existing
    guest ACPI tables in Xen. Therefore, it is safe to copy them from
    QEMU and append to existing guest ACPI tables.

  - A primary portion of current and future vNVDIMM implementation is
    about building ACPI tables. And this design also leave the
    emulation of _DSM to QEMU which needs to keep consistency with
    NFIT and SSDT itself builds. Therefore, reusing NFIT and SSDT from
    QEMU can ease the maintenance.

  - Anthony's work to pass ACPI tables from the toolstack to hvmloader
    does not move building SSDT (and NFIT) to toolstack, so this
    design can still put them in hvmloader.

 (2) Emulating Guest _DSM

  Because the same NFIT and SSDT are used, we can leave the emulation
  of guest _DSM to QEMU. Just as what it does with KVM, QEMU registers
  the _DSM buffer as MMIO region with Xen and then all guest
  evaluations of _DSM are trapped and emulated by QEMU.

3.3.2 Alternative Design 1: switching to QEMU

 Stefano Stabellini's comments [10]:
 | I don't think it is wise to have two components which both think are
 | in control of generating ACPI tables, hvmloader (soon to be the
 | toolstack with Anthony's work) and QEMU. From an architectural
 | perspective, it doesn't look robust to me.
 | Could we take this opportunity to switch to QEMU generating the whole
 | set of ACPI tables?

 So an alternative design could be switching to QEMU to generate the
 whole set of guest ACPI tables. In this way, no controversy would
 happen between multiple agents QEMU and hvmloader. (is this what
 Stefano Stabellini mean by 'robust'?)

 However, looking at the code building ACPI tables in QEMU and
 hvmloader, they are quite different. As ACPI tables are important for
 OS to boot and operate device, it's critical to ensure ACPI tables
 built by QEMU would not break existing guests on Xen. Though I
 believe it could be done after a thorough investigation and
 adjustment, it may take quite a lot of work and tests and should be
 another project besides enabling vNVDIMM in Xen.

3.3.3 Alternative Design 2: keeping in Xen

 Alternative to switching to QEMU, another design would be building
 NFIT and SSDT in hvmloader or toolstack.

 The amount and parameters of sub-structures in guest NFIT vary
 according to different vNVDIMM configurations that can not be decided
 at compile-time. In contrast, current hvmloader and toolstack can
 only build static ACPI tables, i.e. their contents are decided
 statically at compile-time and independent from the guest
 configuration. In order to build guest NFIT at runtime, this design
 may take following steps:
 (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU

 (2) QEMU accepts above options, figures out the start SPA range
     address/size/NVDIMM device handles/..., and writes them in
     xenstore. No ACPI table is built by QEMU.

 (3) Either xl or hvmloader reads above parameters from xenstore and
     builds the NFIT table.

 For guest SSDT, it would take more work. The ACPI namespace devices
 are defined in SSDT by AML, so an AML builder would be needed to
 generate those definitions at runtime.

 This alternative design still needs more work than the first design.

[1] ACPI Specification v6,
[2] NVDIMM Namespace Specification,
[3] NVDIMM Block Window Driver Writer's Guide,
[4] NVDIMM DSM Interface Example,
[5] UEFI Specification v2.6,
[6] Intel Architecture Instruction Set Extensions Programming Reference,
[7] http://www.gossamer-threads.com/lists/xen/devel/414945#414945
[8] http://www.gossamer-threads.com/lists/xen/devel/415658#415658
[9] http://www.gossamer-threads.com/lists/xen/devel/415681#415681
[10] http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg00271.html

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.