[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen



On 10/12/16 05:32 -0600, Jan Beulich wrote:
On 12.10.16 at 12:33, <haozhong.zhang@xxxxxxxxx> wrote:
The layout is shown as the following diagram.

+---------------+-----------+-------+----------+--------------+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel    |   Table   | Block | for Xen  |              |
+---------------+-----------+-------+----------+--------------+
                \_____________________ _______________________/
                                      V
                                 /dev/pmem0

I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.


Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?

Let me list another two methods just coming to my mind.

1. The first method extends the usage of the super block used by
  current Linux kernel to reserve space on pmem.

  Current Linux kernel places a super block of the following
  structure near the beginning of a pmem namespace.

   struct nd_pfn_sb {
           u8 signature[PFN_SIG_LEN];
           u8 uuid[16];
           u8 parent_uuid[16];
           __le32 flags;
           __le16 version_major;
           __le16 version_minor;
           __le64 dataoff; /* relative to namespace_base + start_pad */
           __le64 npfns;
           __le32 mode;
           /* minor-version-1 additions for section alignment */
           __le32 start_pad;
           __le32 end_trunc;
           /* minor-version-2 record the base alignment of the mapping */
           __le32 align;
           u8 padding[4000];
           __le64 checksum;
   }

   Two interesting fields here are 'dataoff' and 'mode':
   - 'dataoff' indicates the offset where the data area starts,
     ie. IIUC, the part that can be accessed via /dev/pmemN or
     /dev/daxN.
   - 'mode' indicates whether Linux puts struct page for this
     namespace in the ram (= PFN_MODE_RAM) or on the device (=
     PFN_MODE_PMEM).

   Currently for Linux, only 'mode' is customizable, while 'dataoff'
   is not. If mode == PFN_MODE_RAM, no reservation for struct page is
   made on the device, and dataoff starts almost immediately after
   the super block except a small reserved area in between for other
   structures and alignment. If mode == PFN_MODE_PMEM, the size of
   the reservation is decided by kernel, i.e. 64 bytes per struct
   page.

   I propose to make the size of the reserved area customizable,
   e.g. via ioctl and ndctl.
   - If mode == PFN_MODE_PMEM and
     * if the given reserved size is large enough to hold what an OS
       (not limited to Linux) wants to put in, then the OS just
       starts use it as desired;
     * if the given reserved size is not enough, then the OS reports
       error and may take other fallback actions.
   - If mode == PFN_MODE_RAM and
     * if the reserved size is zero, then it's the current way that
       Linux uses the device;
     * if the reserved size is non-zero, I would like to reserve this
       case for hypervisor (right now, namely Xen hypervisor)
       usage. That is, the OS should not use the reserved area. For
       Xen, we could add a function in xen driver in kernel to report
       the reserved area to hypervisor.

  I guess this might be the OS-agnostic way Jan expects, but Dan may
  object to.


2. Lay another pseudo device on the block device (e.g. /dev/pmemN)
  provided by the NVDIMM driver.

  This pseudo device can reserve the size according to user's
  requirement. The reservation information can be persistently
  recorded in a super block before the reserved area.

  This pseudo device also implements another pseudo block device to
  allow the non-reserved area be accessed as a block device (we can
  even implement it as DAX-capable).

                                              pseudo block device
                                            /---------^-----------\
+------------------+-------+---------------+-----------------------+
|  whatever used   | Super |  reserved by  |                       |
| by NVDIMM driver | Block | pseudo device |                       |
+------------------+-------+---------------+-----------------------+
                    \_____________________ _______________________/
                                          V
                                      /dev/pmem0
                               (provided by NVDIMM driver)

  In order to make it work across difference OSes, it requires other
  OS recognizes the same types of pmem block devices made by Linux,
  and implements the driver for the pseudo device.

  This is inspired by Dan's reply at
  https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg00651.html.

  However, it's essentially the same as my partition solution, so I guess
  Jan will still dislike.


Any comments?

The assumption of course is that the reserved area holds no
persistent data. If that assumption didn't hold, you'd have to
have per-OS reserved areas anyway (as many of them as
there might be OSes [planned to get] installed on a particular
system).


No persistent data should be placed in the reserved area.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.