[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen



> > > As above, if linux driver detects the signature "NVDIMM_PFN_INFO" and
> > > a matched checksum, it will know it's safe to write to the reserved
> > > area. Otherwise, it will treat the pmem namespace as a raw device and
> > > store page struct's in the normal RAM.
> > 
> > OK, so my worry is that we will have a divergence. Which is that
> > the system admin creates this under ndctl v0, boots Xen uses it.
> > Then moves the NVDIMM to another machine which has ndctl v1 and
> > he/she boots in Linux.
> > 
> > Linux gets all confused b/c the region has something it can't understand
> > and the user is very angry.
> > 
> > So it sounds like the size the ndctl reserves MUST be baked in an ABI
> > and made sure to expand if needed.
> >
> 
> ndctl is a management tool which passes all its requests to the driver
> via sysfs, so the compatibility across different versions of Linux
> would actual be introduced by the different versions of drivers.
> 
> All newer versions of drivers should provide backwards compatibility
> to previous versions (which is the current drivers'
> behavior). However, the forwards compatibility is hard to preserved,
> e.g.
>  - an old version w/o reserved area support (e.g. the one in linux
>    kernel 4.2) recognizes a pmem namespace w/ reserved area as a raw
>    device and may write to the reserved area. If it's a xen reserved
>    area and the driver is in dom0, the dom0 kernel will crash.

Yikes!
>    
>  - the same crash would happen if an old version driver w/ reserved
>    area support but xen reserved area support (e.g. the one in linux
>    kernel 4.7) is used for a pmem namespace w/ xen reserved area.
> 
> For the cross-OS compatibility, there is an effort to standardize the
> reservation. In the meantime, only linux is capable to handle such
> pmem namespaces with reserved area.

It may be good to mention these difficulties you enumerated in the design
doc so if somebody does end up in this position and they search for it - they
could find a reference.

> 
> > ..snip..
> > > > This "balloon out" is interesting. You are effectively telling Linux
> > > > to ignore a certain range of 'struct page_info', so that if somebody
> > > > uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
> > > > can't read the struct page_info anymore).
> > > >
> > > > How would you do this? Simulate an NVDIMM unplug?
> > > 
> > > s/page_info/page/ (struct page for linux, struct page_info for xen)
> > > 
> > > As in Jan's comment, "balloon out" is a confusing name here.
> > > Basically, it's to remove the reserved area from some resource struct
> > > in nvdimm driver to avoid it's accessed out of the driver via the
> > > resource struct. And the nvdimm driver does not map the reserved area,
> > > so I think it cannot be touched via page_walk.
> > 
> > OK, I need to read the Linux code more to make sure I am
> > not missing something.
> > 
> > Basically the question that keeps revolving in my head is:
> > 
> > Why is this even neccessary?
> > 
> > Let me expand - it feels like (and I think I am missing something
> > here) that we are crippling the Linux driver so that it won't
> > break - b/c if it tried to access the 'strut page_info' in this
> > reserved region it would crash. So we eliminate that, and make
> > the driver believe the region exists (is reserved), but it can't
> > use it. And instead use the normal RAM pages to keep track
> > of the NVDIMM SPAs.
> > 
> > Or perhaps not keep track at all and just treat the whole
> > NVDIMM as opaque MMIO that is inaccessible?
> >
> 
> If we trust the driver in dom0 kernel always does correct things (and
> we can trust it, right?), no crash will happen. However, as Jan
> comment 
> (https://lists.xenproject.org/archives/html/xen-devel/2016-08/msg00433.html):
> 
> | Right now Dom0 isn't allowed to access any memory in use by Xen
> | (and not explicitly shared), and I don't think we should deviate
> | from that model for pmem.
> 
> xen hypervisor must explicitly disallow dom0 from accessing the
> reserved area.

Right.
> 
> > But how will that work if there is a DAX filesystem on it?
> > The ext4 needs some mechanism to access the files that are there.
> > (Otherwise you couldn't use the fiemap ioctl to find the SPAs).
> >
> 
> No, the file system does not touch the reserved area. If a reserved

Ah, OK!
> area exists, the start SPA of /dev/pmem0 reported via sysfs is the
> start SPA of the reserved area, so fiemap can still work.
> 
> > [see below]
> > > 
> > > > 
> > > > But if you do that how will SMART tools work anymore? And
> > > > who would do the _DSM checks on the health of the NVDIMM?
> > > >
> > > 
> > > A userspace SMART tool cannot access the reserved area, so I think it
> > > can still work. I haven't look at the implementation of any SMART
> > > tools for NVDIMM, but I guess they would finally call the driver to
> > > evaluate the ARS _DSM which reports the bad blocks. As long as the
> > > driver does not return the bad blocks in the reserved area to SMART
> > > tools (which I suppose to be handled by driver itself), SMART tools
> > > should work fine.
> > > 
> > > > /me scratches his head. Perhaps the answers are later in this
> > > > design..
> > 
> > So I think I figured out the issue here!!
> > 
> > You just want to have the Linux kernel driver to use normal RAM
> > pages to keep track of the NVDIMM SPA ranges.
> 
> Yes, this is what the current driver does for a raw device.
> 
> > As in treat the NVDIMM as if it is normal RAM?
> 
> If you are talking about the location of page struct, then yes.  The
> page struct's for NVDIMM is put in the normal RAM just like the page
> struct's for the normal RAM. But NVDIMM can never, for example, be
> allocated via the kernel memory allocator (buddy/slab/etc.).

Right. I was thinking of page struct location.
> 
> > 
> > [Or is Linux treating this area as MMIO region (in wihch case it does not
> > need struct page_info)??]
> >
> > And then Xen can use this reserved region for its own
> > purpose!
> > 
> > Perhaps then the section that explains this 'reserved region' could
> > say something along:
> > 
> > "We need to keep track of the SPAs. The guest NVDIMM 'file'
> > on the NVDIMM may be in the worst case be randomly and in descending
> > discontingous order (say from the end of the NVDIMM), we need
> > to keep track of each of the SPAs. The reason is that we need
> > the SPAs when we populate the guest EPT.
> > 
> > As such we can store the guest SPA in memory (linear array?)
> > or red-black tree, or any other - but all of them will consume
> > "normal RAM". And with sufficient large enough NVDIMM we may
> > not have enough 'normal RAM' to store this.
> > 
> > Also we only need to know these SPAs during guest creation,
> > destruction, ballooning, etc - hence we may store them on the
> > NVDIMM itself. Fortunatly for us the ndctl and Linux are
> > available which carve out right after the namespace region (128kb)
> > and 'reserved region' which the OS can use to store its
> > struct page_info to cover the full range of the NVDIMM.
> > 
> > The complexity in this is that:
> >  - We MUST make sure Linux does not try to use it while
> >    we use it.
> >  - That the size of this 'reserved region' is sufficiently
> >    large for our 'struct page_info' structure.
> >  - The layout has an ABI baked.
> >  - Linux fs'es with DAX support MUST be able mlock these SPA
> >    regions (so that nobody tries to remove the 'file' while
> >    a guest is using it).
> 
> I need to check whether linux currently does this.
> 
> >  - Linus fs'es with DAX support MUST be able to resize the
> >    'file', hereby using more of the SPAs and rewritting the
> >    properties of the file on DAX (which should then cause an
> >    memory hotplug ACPI in the guest treating the new size of
> >    the file as new NFIT region?)
> >
> 
> Currently my plan is to disallow such resizing and possibly other
> changes out of guest if it's being used by guest (akin to disk) in the
> first implementation. It's mostly for simplicity and we can add it in
> future. For hotplug, we can pass another file as a new pmem namespace
> to guest.
> 
> > "
> > 
> > I think that covers it?
> > ..snip..
> > > > >  Our design takes the following method to avoid and detect collisions.
> > > > >  1) The data layout of area where QEMU copies its NFIT and ACPI
> > > > >     namespace devices is organized as below:
> > > > 
> > > > Why can't this be expressed in XenStore?
> > > > 
> > > > You could have 
> > > > /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, type}
> > > > ?
> > > >
> > > 
> > > If XenStore can be used, then it could save some guest memory.
> > 
> > It is also easier than relaying on the format of a blob in memory.
> > > 
> > > This is a general mechanism to pass ACPI which and is not limited to
> > > NVDIMM, so it means QEMU may pass a lot of entries. I'm not sure if
> > > XenStore is still a proper place when the number is large. Maybe we
> > > should put an upper limit for the number of entries.
> > 
> > Why put a limit on it? It should easily handle thousands of <name>.
> > And the only attributes you have under <name> are just address,
> > length and type.
> >
> 
> OK, if it's not a problem, I will use xenstore to pass those
> information.
> 
> > .. snip..
> > > > > 4.3.2 Emulating Guest _DSM
> > > > > 
> > > > >  Our design leaves the emulation of guest _DSM to QEMU. Just as what
> > > > >  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
> > > > >  Xen and then all guest evaluations of _DSM are trapped and emulated
> > > > >  by QEMU.
> > > > 
> > > > Sweet!
> > > > 
> > > > So one question that I am not if it has been answered, with the
> > > > 'struct page_info' being removed from the dom0 how will OEM _DSM method
> > > > operation? For example some of the AML code may asking to poke
> > > > at specific SPAs, but how will Linux do this properly without
> > > > 'struct page_info' be available?
> > > >
> > > 
> > > (s/page_info/page/)
> > > 
> > > The current Intel NVDIMM driver in Linux does not evaluate any OEM
> > > _DSM method, so I'm not sure whether the kernel has to access a NVDIMM
> > > page during evaluating _DSM.
> > > 
> > > The most close one in my mind, though not an OEM _DSM, is function 1
> > > of ARS _DSM, which requires inputs of a start SPA and a length in
> > > bytes. After kernel gives the inputs, the scrubbing of the specified
> > > area is done by the hardware and does not requires any mappings in OS.
> > 
> > <nods>
> > > 
> > > Any example of such OEM _DSM methods?
> > 
> > I can't think of any right now - but that is the danger of OEMs - they
> > may decide to do something .. ill advisable. Hence having it work
> > the same way as Linux is what we should strive for.
> > 
> 
> I see: though the evaluation itself does not use any software
> maintained mappings, the driver may use when handling the result of
> evaluation, e.g. ARS _DSM reports bad blocks in the reserved area and
> the driver may then have to access the reserved area (though this
> could never happen in the current kernel because the driver does ARS
> before reservation).
> 
> Currently there is no OEM _DSM support in linux kernel, so I cannot
> think of any solution. However, if such an OEM _DSM comes, we may add
> xen specific handling to the driver or introduce a way in nvdimm
> driver framework to avoid accessing the reserved area in certain
> circumstances (e.g. when used in xen dom0).

Thanks!
> 
> Thanks,
> Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.