[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Understanding osdep_xenforeignmemory_map mmap behaviour



Juergen Gross <jgross@xxxxxxxx> writes:

> [[PGP Signed Part:Undecided]]
> On 24.08.22 17:58, Alex Bennée wrote:
>> Juergen Gross <jgross@xxxxxxxx> writes:
>> 
>>> [[PGP Signed Part:Undecided]]
>>> On 24.08.22 13:22, Alex Bennée wrote:
>>>> Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx> writes:
>>>>
>>>>> On 24/08/2022 10:19, Viresh Kumar wrote:
>>>>>> On 24-03-22, 06:12, Juergen Gross wrote:
>>>>>>> For a rather long time we were using "normal" user pages for this 
>>>>>>> purpose,
>>>>>>> which were just locked into memory for doing the hypercall.
>>>>>>>
>>>>>>> Unfortunately there have been very rare problems with that approach, as
>>>>>>> the Linux kernel can set a user page related PTE to invalid for short
>>>>>>> periods of time, which led to EFAULT in the hypervisor when trying to
>>>>>>> access the hypercall data.
>>>>>>>
>>>>>>> In Linux this can avoided only by using kernel memory, which is the
>>>>>>> reason why the hypercall buffers are allocated and mmap()-ed through the
>>>>>>> privcmd driver.
>>>>>> Hi Juergen,
>>>>>>
>>>>>> I understand why we moved from user pages to kernel pages, but I don't
>>>>>> fully understand why we need to make two separate calls to map the
>>>>>> guest memory, i.e. mmap() followed by ioctl(IOCTL_PRIVCMD_MMAPBATCH).
>>>>>>
>>>>>> Why aren't we doing all of it from mmap() itself ? I hacked it up to
>>>>>> check on it and it works fine if we do it all from mmap() itself.
>>>> As I understand it the MMAPBATCH ioctl is being treated like every
>>>> other
>>>> hypercall proxy through the ioctl interface. Which makes sense from the
>>>> point of view of having a consistent interface to the hypervisor but not
>>>> from point of view of providing a consistent userspace interface for
>>>> mapping memory which doesn't care about the hypervisor details.
>>>> The privcmd_mmapbatch_v2 interface is slightly richer than what you
>>>> could expose via mmap() because it allows the handling of partial
>>>> mappings with what I presume is a per-page *err array. If you issued the
>>>> hypercall directly from the mmap() and one of the pages wasn't mapped by
>>>> the hypervisor you would have to unwind everything before returning
>>>> EFAULT to the user.
>>>>
>>>>>> Aren't we abusing the Linux userspace ABI here ? As standard userspace
>>>>>> code would expect just mmap() to be enough to map the memory. Yes, the
>>>>>> current user, Xen itself, is adapted to make two calls, but it breaks
>>>>>> as soon as we want to use something that relies on Linux userspace
>>>>>> ABI.
>>>>>>
>>>>>> For instance, in our case, where we are looking to create
>>>>>> hypervisor-agnostic virtio backends, the rust-vmm library [1] issues
>>>>>> mmap() only and expects it to work. It doesn't know it is running on a
>>>>>> Xen system, and it shouldn't know that as well.
>>>>>
>>>>> Use /dev/xen/hypercall which has a sane ABI for getting "safe" memory.
>>>>> privcmd is very much not sane.
>>>>>
>>>>> In practice you'll need to use both.  /dev/xen/hypercall for getting
>>>>> "safe" memory, and /dev/xen/privcmd for issuing hypercalls for now.
>>>> I'm unsure what is meant by safe memory here. privcmd_buf_mmap()
>>>> looks
>>>> like it just allocates a bunch of GFP_KERNEL pages rather than
>>>> interacting with the hypervisor directly. Are these the same pages that
>>>> get used when you eventually call privcmd_ioctl_mmap_batch()?
>>>
>>> privcmd_buf_mmap() is allocating kernel pages which are used for data being
>>> accessed by the hypervisor when doing the hypercall later. This is a generic
>>> interface being used for all hypercalls, not only for
>>> privcmd_ioctl_mmap_batch().
>>>
>>>> The fact that /dev/xen/hypercall is specified by xen_privcmdbuf_dev is a
>>>> little confusing TBH.
>>>> Anyway the goal here is to provide a non-xen aware userspace with
>>>> standard userspace API to access the guests memory. Perhaps messing
>>>
>>> This is what the Xen related libraries are meant for. Your decision to
>>> ignore those is firing back now.
>> We didn't ignore them - the initial version of the xen-vhost-master
>> binary was built with the rust and linking to the Xen libraries. We are
>> however in the process of moving to more pure rust (with the xen-sys
>> crate being a pure rust ioctl/hypercall wrapper).
>
> Ah, okay, I wasn't aware of this.
>
>> However I was under the impression there where two classes of
>> hypercalls. ABI stable ones which won't change (which is all we are
>> planning to implement for xen-sys) and non-stable ABIs which would need
>> mediating by the xen libs. We are hoping we can do all of VirtIO with
>> just the stable ABI.
>
> Okay.
>
>> 
>>>> around with the semantics of the /dev/xen/[hypercall|privcmd] devices
>>>> nodes is too confusing.
>>>> Maybe we could instead:
>>>>    1. Have the Xen aware VMM ask to make the guests memory visible to
>>>> the
>>>>       host kernels address space.
>>>
>>> Urgh. This would be a major breach of the Xen security concept.
>>>
>>>>    2. When this is done explicitly create a device node to represent it 
>>>> (/dev/xen/dom-%d-mem?)
>>>>    3. Pass this new device to the non-Xen aware userspace which uses the
>>>>       standard mmap() call to make the kernel pages visible to userspace
>>>> Does that make sense?
>>>
>>> Maybe from your point of view, but not from the Xen architectural point
>>> of view IMHO. You are removing basically the main security advantages of
>>> Xen by generating a kernel interface for mapping arbitrary guest memory
>>> easily.
>> We are not talking about doing an end-run around the Xen
>> architecture.
>> The guest still has to instruct the hypervisor to grant access to its
>> memory. Currently this is a global thing (i.e. whole address space or
>> nothing) but obviously more fine grained grants can be done on a
>> transaction by transaction basis although we are exploring more
>> efficient mechanisms for this (shared pools and carve outs).
>
> Happy to hear that.
>
>> This does raise questions for the mmap interface though - each
>> individually granted region would need to be mapped into the dom0
>> userspace virtual address space or perhaps a new flag for mmap() so we
>> can map the whole address space but expect SIGBUS faults if we access
>> something that hasn't been granted.
>
> Do I understand that correctly? You want the guest to grant a memory
> region to the backend, and the backend should be able to map this region
> not using grants, but the guest physical addresses?

Yes - although it doesn't have to be the whole GPA range. The vhost-user
protocol communicates what offset into the GPA space the various memory
regions exist at.

>
>
> Juergen
>
> [2. OpenPGP public key --- application/pgp-keys; 
> OpenPGP_0xB0DE9DD628BF132F.asc]...
>
> [[End of PGP Signed Part]]


-- 
Alex Bennée



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.