Xen project Mailing List

Juergen Gross <jgross@xxxxxxxx> writes: > [[PGP Signed Part:Undecided]] > On 24.08.22 17:58, Alex Bennée wrote: >> Juergen Gross <jgross@xxxxxxxx> writes: >> >>> [[PGP Signed Part:Undecided]] >>> On 24.08.22 13:22, Alex Bennée wrote: >>>> Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx> writes: >>>> >>>>> On 24/08/2022 10:19, Viresh Kumar wrote: >>>>>> On 24-03-22, 06:12, Juergen Gross wrote: >>>>>>> For a rather long time we were using "normal" user pages for this >>>>>>> purpose, >>>>>>> which were just locked into memory for doing the hypercall. >>>>>>> >>>>>>> Unfortunately there have been very rare problems with that approach, as >>>>>>> the Linux kernel can set a user page related PTE to invalid for short >>>>>>> periods of time, which led to EFAULT in the hypervisor when trying to >>>>>>> access the hypercall data. >>>>>>> >>>>>>> In Linux this can avoided only by using kernel memory, which is the >>>>>>> reason why the hypercall buffers are allocated and mmap()-ed through the >>>>>>> privcmd driver. >>>>>> Hi Juergen, >>>>>> >>>>>> I understand why we moved from user pages to kernel pages, but I don't >>>>>> fully understand why we need to make two separate calls to map the >>>>>> guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH). >>>>>> >>>>>> Why aren't we doing all of it from mmap() itself ? I hacked it up to >>>>>> check on it and it works fine if we do it all from mmap() itself. >>>> As I understand it the MMAPBATCH ioctl is being treated like every >>>> other >>>> hypercall proxy through the ioctl interface. Which makes sense from the >>>> point of view of having a consistent interface to the hypervisor but not >>>> from point of view of providing a consistent userspace interface for >>>> mapping memory which doesn't care about the hypervisor details. >>>> The privcmd_mmapbatch_v2 interface is slightly richer than what you >>>> could expose via mmap() because it allows the handling of partial >>>> mappings with what I presume is a per-page *err array. If you issued the >>>> hypercall directly from the mmap() and one of the pages wasn't mapped by >>>> the hypervisor you would have to unwind everything before returning >>>> EFAULT to the user. >>>> >>>>>> Aren't we abusing the Linux userspace ABI here ? As standard userspace >>>>>> code would expect just mmap() to be enough to map the memory. Yes, the >>>>>> current user, Xen itself, is adapted to make two calls, but it breaks >>>>>> as soon as we want to use something that relies on Linux userspace >>>>>> ABI. >>>>>> >>>>>> For instance, in our case, where we are looking to create >>>>>> hypervisor-agnostic virtio backends, the rust-vmm library [1] issues >>>>>> mmap() only and expects it to work. It doesn't know it is running on a >>>>>> Xen system, and it shouldn't know that as well. >>>>> >>>>> Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory. >>>>> privcmd is very much not sane. >>>>> >>>>> In practice you'll need to use both. /dev/xen/hypercall for getting >>>>> "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now. >>>> I'm unsure what is meant by safe memory here. privcmd_buf_mmap() >>>> looks >>>> like it just allocates a bunch of GFP_KERNEL pages rather than >>>> interacting with the hypervisor directly. Are these the same pages that >>>> get used when you eventually call privcmd_ioctl_mmap_batch()? >>> >>> privcmd_buf_mmap() is allocating kernel pages which are used for data being >>> accessed by the hypervisor when doing the hypercall later. This is a generic >>> interface being used for all hypercalls, not only for >>> privcmd_ioctl_mmap_batch(). >>> >>>> The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is a >>>> little confusing TBH. >>>> Anyway the goal here is to provide a non-xen aware userspace with >>>> standard userspace API to access the guests memory. Perhaps messing >>> >>> This is what the Xen related libraries are meant for. Your decision to >>> ignore those is firing back now. >> We didn't ignore them - the initial version of the xen-vhost-master >> binary was built with the rust and linking to the Xen libraries. We are >> however in the process of moving to more pure rust (with the xen-sys >> crate being a pure rust ioctl/hypercall wrapper). > > Ah, okay, I wasn't aware of this. > >> However I was under the impression there where two classes of >> hypercalls. ABI stable ones which won't change (which is all we are >> planning to implement for xen-sys) and non-stable ABIs which would need >> mediating by the xen libs. We are hoping we can do all of VirtIO with >> just the stable ABI. > > Okay. > >> >>>> around with the semantics of the /dev/xen/[hypercall|privcmd] devices >>>> nodes is too confusing. >>>> Maybe we could instead: >>>> 1. Have the Xen aware VMM ask to make the guests memory visible to >>>> the >>>> host kernels address space. >>> >>> Urgh. This would be a major breach of the Xen security concept. >>> >>>> 2. When this is done explicitly create a device node to represent it >>>> (/dev/xen/dom-%d-mem?) >>>> 3. Pass this new device to the non-Xen aware userspace which uses the >>>> standard mmap() call to make the kernel pages visible to userspace >>>> Does that make sense? >>> >>> Maybe from your point of view, but not from the Xen architectural point >>> of view IMHO. You are removing basically the main security advantages of >>> Xen by generating a kernel interface for mapping arbitrary guest memory >>> easily. >> We are not talking about doing an end-run around the Xen >> architecture. >> The guest still has to instruct the hypervisor to grant access to its >> memory. Currently this is a global thing (i.e. whole address space or >> nothing) but obviously more fine grained grants can be done on a >> transaction by transaction basis although we are exploring more >> efficient mechanisms for this (shared pools and carve outs). > > Happy to hear that. > >> This does raise questions for the mmap interface though - each >> individually granted region would need to be mapped into the dom0 >> userspace virtual address space or perhaps a new flag for mmap() so we >> can map the whole address space but expect SIGBUS faults if we access >> something that hasn't been granted. > > Do I understand that correctly? You want the guest to grant a memory > region to the backend, and the backend should be able to map this region > not using grants, but the guest physical addresses? Yes - although it doesn't have to be the whole GPA range. The vhost-user protocol communicates what offset into the GPA space the various memory regions exist at. > > > Juergen > > [2. OpenPGP public key --- application/pgp-keys; > OpenPGP_0xB0DE9DD628BF132F.asc]... > > [[End of PGP Signed Part]] -- Alex Bennée

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.

Re: Understanding osdep_xenforeignmemory_map mmap behaviour