[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/5] docs/qemu-deprivilege: Revise and update with status and future plans


  • To: <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: George Dunlap <george.dunlap@xxxxxxxxxx>
  • Date: Fri, 26 Oct 2018 13:45:05 +0100
  • Autocrypt: addr=george.dunlap@xxxxxxxxxx; prefer-encrypt=mutual; keydata= xsFNBFPqG+MBEACwPYTQpHepyshcufo0dVmqxDo917iWPslB8lauFxVf4WZtGvQSsKStHJSj 92Qkxp4CH2DwudI8qpVbnWCXsZxodDWac9c3PordLwz5/XL41LevEoM3NWRm5TNgJ3ckPA+J K5OfSK04QtmwSHFP3G/SXDJpGs+oDJgASta2AOl9vPV+t3xG6xyfa2NMGn9wmEvvVMD44Z7R W3RhZPn/NEZ5gaJhIUMgTChGwwWDOX0YPY19vcy5fT4bTIxvoZsLOkLSGoZb/jHIzkAAznug Q7PPeZJ1kXpbW9EHHaUHiCD9C87dMyty0N3TmWfp0VvBCaw32yFtM9jUgB7UVneoZUMUKeHA fgIXhJ7I7JFmw3J0PjGLxCLHf2Q5JOD8jeEXpdxugqF7B/fWYYmyIgwKutiGZeoPhl9c/7RE Bf6f9Qv4AtQoJwtLw6+5pDXsTD5q/GwhPjt7ohF7aQZTMMHhZuS52/izKhDzIufl6uiqUBge 0lqG+/ViLKwCkxHDREuSUTtfjRc9/AoAt2V2HOfgKORSCjFC1eI0+8UMxlfdq2z1AAchinU0 eSkRpX2An3CPEjgGFmu2Je4a/R/Kd6nGU8AFaE8ta0oq5BSFDRYdcKchw4TSxetkG6iUtqOO ZFS7VAdF00eqFJNQpi6IUQryhnrOByw+zSobqlOPUO7XC5fjnwARAQABzSRHZW9yZ2UgVy4g RHVubGFwIDxkdW5sYXBnQHVtaWNoLmVkdT7CwYAEEwEKACoCGwMFCwkIBwMFFQoJCAsFFgID AQACHgECF4ACGQEFAlpk2IEFCQo9I54ACgkQpjY8MQWQtG1A1BAAnc0oX3+M/jyv4j/ESJTO U2JhuWUWV6NFuzU10pUmMqpgQtiVEVU2QbCvTcZS1U/S6bqAUoiWQreDMSSgGH3a3BmRNi8n HKtarJqyK81aERM2HrjYkC1ZlRYG+jS8oWzzQrCQiTwn3eFLJrHjqowTbwahoiMw/nJ+OrZO /VXLfNeaxA5GF6emwgbpshwaUtESQ/MC5hFAFmUBZKAxp9CXG2ZhTP6ROV4fwhpnHaz8z+BT NQz8YwA4gkmFJbDUA9I0Cm9D/EZscrCGMeaVvcyldbMhWS+aH8nbqv6brhgbJEQS22eKCZDD J/ng5ea25QnS0fqu3bMrH39tDqeh7rVnt8Yu/YgOwc3XmgzmAhIDyzSinYEWJ1FkOVpIbGl9 uR6seRsfJmUK84KCScjkBhMKTOixWgNEQ/zTcLUsfTh6KQdLTn083Q5aFxWOIal2hiy9UyqR VQydowXy4Xx58rqvZjuYzdGDdAUlZ+D2O3Jp28ez5SikA/ZaaoGI9S1VWvQsQdzNfD2D+xfL qfd9yv7gko9eTJzv5zFr2MedtRb/nCrMTnvLkwNX4abB5+19JGneeRU4jy7yDYAhUXcI/waS /hHioT9MOjMh+DoLCgeZJYaOcgQdORY/IclLiLq4yFnG+4Ocft8igp79dbYYHkAkmC9te/2x Kq9nEd0Hg288EO/OwE0EVFq6vQEIAO2idItaUEplEemV2Q9mBA8YmtgckdLmaE0uzdDWL9To 1PL+qdNe7tBXKOfkKI7v32fe0nB4aecRlQJOZMWQRQ0+KLyXdJyHkq9221sHzcxsdcGs7X3c 17ep9zASq+wIYqAdZvr7pN9a3nVHZ4W7bzezuNDAvn4EpOf/o0RsWNyDlT6KECs1DuzOdRqD oOMJfYmtx9hMzqBoTdr6U20/KgnC/dmWWcJAUZXaAFp+3NYRCkk7k939VaUpoY519CeLrymd Vdke66KCiWBQXMkgtMGvGk5gLQLy4H3KXvpXoDrYKgysy7jeOccxI8owoiOdtbfM8TTDyWPR Ygjzb9LApA8AEQEAAcLBZQQYAQoADwIbDAUCWmTXMwUJB+tP9gAKCRCmNjwxBZC0bb+2D/9h jn1k5WcRHlu19WGuH6q0Kgm1LRT7PnnSz904igHNElMB5a7wRjw5kdNwU3sRm2nnmHeOJH8k Yj2Hn1QgX5SqQsysWTHWOEseGeoXydx9zZZkt3oQJM+9NV1VjK0bOXwqhiQyEUWz5/9l467F S/k4FJ5CHNRumvhLa0l2HEEu5pxq463HQZHDt4YE/9Y74eXOnYCB4nrYxQD/GSXEZvWryEWr eDoaFqzq1TKtzHhFgQG7yFUEepxLRUUtYsEpT6Rks2l4LCqG3hVD0URFIiTyuxJx3VC2Ta4L H3hxQtiaIpuXqq2D4z63h6vCx2wxfZc/WRHGbr4NAlB81l35Q/UHyMocVuYLj0llF0rwU4Aj iKZ5qWNSEdvEpL43fTvZYxQhDCjQTKbb38omu5P4kOf1HT7s+kmQKRtiLBlqHzK17D4K/180 ADw7a3gnmr5RumcZP3NGSSZA6jP5vNqQpNu4gqrPFWNQKQcW8HBiYFgq6SoLQQWbRxJDHvTR YJ2ms7oCe870gh4D1wFFqTLeyXiVqjddENGNaP8ZlCDw6EU82N8Bn5LXKjR1GWo2UK3CjrkH pTt3YYZvrhS2MO2EYEcWjyu6LALF/lS6z6LKeQZ+t9AdQUcILlrx9IxqXv6GvAoBLJY1jjGB q+/kRPrWXpoaQn7FXWGfMqU+NkY9enyrlw==
  • Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>, Konrad Wilk <konrad.wilk@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>, Julien Grall <julien.grall@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Anthony Perard <anthony.perard@xxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxx>
  • Delivery-date: Fri, 26 Oct 2018 12:45:28 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

Ping?  It's been nearly 3 weeks with only minor review for this series.

 -George

On 10/05/2018 05:56 PM, George Dunlap wrote:
> docs/qemu-deprivilege.txt had some basic instructions for using
> dm_restrict, but it was incomplete, misleading, and stale.
> 
> Update the docs in a number of ways.
> 
> First, separate user-facing documentation and technical description
> into docs/features and docs/design, respectively.
> 
> In the feature doc:
> 
> * Introduce a section mentioning minimim versions of Linux, Xen, and
> qemu required (TBD)
> 
> * Fix the discussion of qemu userid.  Mention xen-qemuuser-range-base,
> and provide example shell code that actually has some hope of working
> (instead of failing out after creating 900 userids).
> 
> * Describe how to enable restrictions, as well as features which
> probably don't or definitely don't work.
> 
> In the design doc, introduce a "Technical Details" section which
> describes specifically what restrictions are currently done, and also
> what restrictions we are looking at doing in the future.
> 
> The idea here is that as we implement the various items for the
> future, we move them from "Restrictions still to do" to "Restrictions
> done".  This can also act as a design document -- a place for public
> discussion of what can or should be done and how.
> 
> Also add an entry to SUPPORT.md.
> 
> Signed-off-by: George Dunlap <george.dunlap@xxxxxxxxxx>
> ---
> Changes since v2:
> - Extraneous privcmd / evtchn instances aren't closed
> - Expand description of how to test fd deprivileging
> - Rework and clarify two namespace sections, give reference for QEMU NAK
> - Add more information about migration technical challenges
> - In UID section, mention possibility of container ID collisions.
> - Fix name of design document.
> - Add SUPPORT.md statement.  Specify Linux, to make sure that FreeBSD is
>   evaluated separately.
> - Mention that `-sandbox` is a blacklist and why
> 
> Changes since v1:
> - Break into two, and move into appropriate directories (rather than 'misc')
> - Updated version requirements
> - Distinguish between features which "don't yet work" and features which we 
> never expect to work
> - Update description of xen-restrict functionality
> - Reorder and expand further restrictions
> - Make it more clear which restrictions are available on Linux only
> - Include detailed description of how to kill a process
> - Add RLIMIT_NPROC as something we can do without further changes to qemu
> - Document the need to check for the sandbox feature before using it
> 
> Thank you to Ross Lagerwall, whose description of what XenServer is
> doing formed much of the basis for the text here.
> 
> CC: Ian Jackson <ian.jackson@xxxxxxxxxx>
> CC: Wei Liu <wei.liu2@xxxxxxxxxx>
> CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> CC: Jan Beulich <jbeulich@xxxxxxxx>
> CC: Tim Deegan <tim@xxxxxxx>
> CC: Konrad Wilk <konrad.wilk@xxxxxxxxxx>
> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> CC: Julien Grall <julien.grall@xxxxxxx>
> CC: Anthony Perard <anthony.perard@xxxxxxxxxx>
> CC: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>
> ---
>  SUPPORT.md                            |  20 ++
>  docs/designs/qemu-deprivilege.md      | 322 ++++++++++++++++++++++++++
>  docs/features/qemu-deprivilege.pandoc |  94 ++++++++
>  docs/misc/qemu-deprivilege.txt        |  36 ---
>  4 files changed, 436 insertions(+), 36 deletions(-)
>  create mode 100644 docs/designs/qemu-deprivilege.md
>  create mode 100644 docs/features/qemu-deprivilege.pandoc
>  delete mode 100644 docs/misc/qemu-deprivilege.txt
> 
> diff --git a/SUPPORT.md b/SUPPORT.md
> index 3727446b83..b5e7e44fb3 100644
> --- a/SUPPORT.md
> +++ b/SUPPORT.md
> @@ -525,6 +525,26 @@ Vulnerabilities of a device model stub domain
>  to a hostile driver domain (either compromised or untrusted)
>  are excluded from security support.
>  
> +### Device Model Deprivileging
> +
> +    Status, Linux: Tech Preview, with limited support
> +
> +This means adding extra restrictions to a device model running in
> +domain 0 in order to prevent a compromised device model to attack the
> +rest of the system.
> +
> +"Tech preview with limited support" means we will not issue XSAs for
> +the _additional_ functionality provided by the feature; but we will
> +issue XSAs in the event that enabling this feature opens up a security
> +hole that would not be present without the feature disabled.
> +
> +For example, while this is classified as tech preview, a bug in libxl
> +which failed to change the user ID of QEMU would not receive an XSA,
> +since without this feature the user ID wouldn't be changed. But a
> +change which made it possible for a compromised guest to read
> +arbitrary files on the host filesystem without compromising QEMU would
> +be issued an XSA, since that does weaken security.
> +
>  ### KCONFIG Expert
>  
>      Status: Experimental
> diff --git a/docs/designs/qemu-deprivilege.md 
> b/docs/designs/qemu-deprivilege.md
> new file mode 100644
> index 0000000000..d3c6495030
> --- /dev/null
> +++ b/docs/designs/qemu-deprivilege.md
> @@ -0,0 +1,322 @@
> +# Introduction
> +
> +The goal of deprilvileging qemu is this: Even if there is a bug (for
> +example in qemu) which permits a domain to gain control of the device
> +model, the compromised device model process is prevented from
> +violating the system's overall security properties.  Ie, a guest
> +cannot "escape" from the virtualisation by using a qemu bug.
> +
> +This document lists the various technical measures which we either
> +have taken, or plan to take to effect this goal.  Some of them are
> +required to be considered secure (that is, there are known attack
> +vectors which they close); others are "just in case" (that is, there
> +are no known attack vectors, but we perform the restrictions to reduce
> +the possibility of unknown attack vectors).
> +
> +# Restrictions done
> +
> +The following restrictions are currently implemented.
> +
> +## Having qemu switch user
> +
> +'''Description''': As mentioned above, having QEMU switch to a
> +non-root user, one per domain id.  Not being the root user limits what
> +a compromised QEMU process can do to the system, and having one user
> +per domain id limits what a comprimised QEMU process can do to the
> +QEMU processes of other VMs.
> +
> +'''Implementation''': The toolstack adds the following to the qemu 
> command-line:
> +
> +    -runas <uid>:<gid>
> +
> +'''How to test''':
> +
> +    grep /proc/<qpid>/status [UG]id
> +
> +'''Testing Status''': Not tested
> +
> +## Xen library / file-descriptor restrictions
> +
> +'''Description''': Close and restrict Xen-related file descriptors.
> +Specifically:
> + * Close all xenstore-related file descriptors
> + * Make sure that all open instances of `privcmd` and `evtchn` file
> +descriptors have had `IOCTL_PRIVCMD_RESTRICT` and
> +`IOCTL_EVTCHN_RESTRICT_DOMID` ioctls called on them, respectively.
> +
> +FIXME: Double-check the correctness of the above
> +
> +'''Implementation''': Toolstack adds the following to the qemu command-line:
> +
> +    -xen-domid-restrict
> +
> +'''How to test''':
> +
> +Use `fishdescriptor` to pull a file descriptor from a running QEMU,
> +then use `depriv-fd-checker` to check that it has the desired
> +properties, and that hypercalls which are meant to fail do fail.  (In
> +Debian `fishdescriptor` can be found in the binary package
> +`chiark-scripts`; the `depriv-fd-checker` is included in the Xen
> +source tree.)
> +
> +'''Testing status''': Tested
> +
> +# Restrictions / improvements still to do
> +
> +This lists potential restrictions still to do.  It is meant to be
> +listed in order of ease of implementation, with low-hanging fruit
> +first.
> +
> +## Chroot
> +
> +'''Description''': Qemu runs in its own chroot, such that even if it
> +could call an 'open' command of some sort, there would be nothing for
> +it to see.
> +
> +'''Implementation''': The toolstack creates a directory in the libxl 
> "run-dir"; e.g.
> +`/var/run/xen/qemu-root-<domid>`
> +
> +Then adds the following to the qemu command-line:
> +
> +    -chroot /var/run/xen/qemu-root-<domid>
> +     
> +'''How to test''':  Check `/proc/<qpid>/root`
> +     
> +'''Tested''': Not tested
> +
> +## Namespaces for unused functionality (Linux only)
> +
> +'''Description''': QEMU doesn't use the functionality associated with
> +mount and IPC namespaces. (IPC namespaces contol non-file-based IPC
> +mechanisms within the kernel; unix and network sockets are not
> +affected by this.)  Making separate namespaces for these for QEMU
> +won't affect normal operation, but it does mean that even if other
> +restrictions fail, the process won't be able to even name system mount
> +points or exsting non-file-based IPC descriptors to attempt to attack
> +them.
> +
> +'''Implementation''':
> +
> +In theory this could be done in QEMU (similar to -sandbox, -runas,
> +-chroot, and so on), but a patch doing this in QEMU was NAKed upstream
> +(see [qemu-namespaces]). They preferred that this was done as a setup step by
> +whatever executes QEMU; i.e., have the process which exec's QEMU first
> +call:
> +
> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
> +     
> +'''How to test''':  Check `/proc/<qpid>/ns/[ipc,mnt]`
> +
> +'''Tested''': Not tested
> +
> +[qemu-namespaces]: 
> https://lists.gnu.org/archive/html/qemu-devel/2017-10/msg04723.html
> +
> +### Basic RLIMITs
> +
> +'''Description''': A number of limits on the resources that a given
> +process / userid is allowed to consume.  These can limit the ability
> +of a compromised QEMU process to DoS domain 0 by exhausting various
> +resources available to it.
> +
> +'''Implementation'''
> +
> +Limits that can be implemented immediately without much effort:
> + - RLIMIT_FSIZE` (file size) to 256KiB.
> + - RLIMIT_NPROC (after uid changes to a unique uid)
> +
> +Probably not necessary but why not:
> + - RLIMIT_CORE: 0
> + - RLIMIT_MSGQUEUE: 0
> + - RLIMIT_LOCKS: 0
> + - RLIMIT_MEMLOCK: 0
> + 
> +Note: mlock() is used by QEMU only when both "realtime" and "mlock"
> +are specified; this does not apply to QEMU running as a Xen DM.
> +   
> +'''How to test''': Check `/proc/<qpid>/limits`
> +
> +'''Tested''': Not tested
> +
> +### Further RLIMITs
> +
> +RLIMIT_AS limits the total amount of memory; but this includes the
> +virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
> +fiddles with this; it would be straightforward to make it *set* the
> +rlimit to what it thinks a sensible limit is.
> +
> +Other things that would take some cleverness / changes to QEMU to
> +utilize due to ordering constrants:
> + - RLIMIT_NOFILES (after all necessary files are opened)
> +
> +### libxl UID cleanup
> +
> +'''Description''': Domain IDs are reused, and thus restricted UIDs are
> +reused.  If a compromised QEMU can fork (due to seccomp or
> +RLIMIT_NPROC limits being ineffective for some reason), it may avoid
> +being killed when its domain dies, then wait until the domain ID is
> +reused again, at which point it will have control over the domain in
> +question (which probably belongs to someone else).
> +
> +libxl should kill all UIDs associated with a domain both when the VM
> +is destroyed, and before starting a VM with the same UID.
> +
> +'''Implementation''': This is unnecessarily tricky.
> +
> +The kill() system call can have three kinds of targets:
> + - A single pid
> + - A process group
> + - "Every process except me to which I am allowed to send a signal" (-1)
> +
> +Targeting a single pid is racy and likely to be beaten by the
> +following loop:
> +
> +    while(1) {
> +        if(fork())
> +         _exit(0);
> +    }          
> +
> +That is, by the time you've read the process list and found the
> +process id you want to kill, that process has exited and there is a
> +new process whose pid you don't know about.
> +
> +Targeting a process group will be ineffective, as unprivileged
> +processes are allowed to make their own process groups.
> +
> +kill(-1) can be used but must be done with care.  Consider the
> +following code, for example:
> +
> +    setuid(target_uid);
> +    kill(-1, 9);
> +
> +This looks like it will do the trick; but by setting all of the user
> +ids (effective, real, and saved), it opens the 'killing' process up to
> +being killed by the target process:
> +
> +    while(1) {
> +        if(fork())
> +            _exit(0);
> +        else
> +            kill(-1, 9);
> +    }
> +
> +Fortunately there is an assymetry we can take advantage of.  From the
> +POSIX spec:
> +
> +> For a process to have permission to send a signal to a process
> +> designated by pid, unless the sending process has appropriate
> +> privileges, the real or effective user ID of the sending process shall
> +> match the real or saved set-user-ID of the receiving process.
> +
> +The solution is to allocate a second "reaper" uid that is only used to kill
> +target processes.  We set the euid of the killing process to the 
> `target_uid`,
> +but the ruid of the killing process to `reaper_uid`, leaving the suid of the
> +killing process as 0:
> +
> +    setresuid(reaper_uid, target_uid, 0);
> +    kill(-1, 9);
> +
> +NOTE: We cannot use `setreuid(reaper_uid, target_uid)` here, as that
> +will set *both* euid *and* suid to `target_uid`, making the killing
> +process vulnerable to the target process again.
> +
> +Since this will kill all other `reaper_uid` processes as well, we must
> +either allocate a separate `reaper_uid` per domain, or use locking to
> +ensure that only one killing process is active at a time.
> +
> +## libxl: Treat QMP connection as untrusted
> +
> +'''Description''': Currently libxl talks with QEMU via QMP; but its
> +interactions have not historically considered from a security point of
> +view.  For example, qmp_synchronous_send() waits for a response from
> +QEMU, which a compromised QEMU could simply not send (thus preventing
> +the toolstack from making forward progress).
> +
> +'''Implementation''': Audit toolstack interactions with QEMU which
> +happen after the guest has started running, and assume QEMU has been
> +compromised.
> +
> +### seccomp filtering (Linux only)
> +
> +'''Description''': Turn on seccomp filtering to disable syscalls which
> +QEMU doesn't need. 
> +
> +'''Implementation''': Enable from the command-line:
> +
> +    -sandbox 
> on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
> +
> +`elevateprivileges` is currently required to allow `-runas` to work.
> +Removing this requirement would mean making sure that the uid change
> +happened before the seccomp2 call, perhaps by changing the uid before
> +executing QEMU.  (But this would then require other changes to create
> +the QMP socket, VNC socket, and so on).
> +
> +It should be noted that `-sandbox` is implemented as a blacklist, not
> +a whitelist; that is, it disables known-unsed functionality which may
> +be harmful, rather than disabling all functionality except that known
> +to be safe and needed.  This is unfortunately necessary since qemu
> +doesn't know what system calls libraries might end up making.  (See
> +[lwn-seccomp] for a more complete discussion.)
> +
> +This feature is not on by default and may not be available in all
> +environments.  We therefore need to either:
> + 1. Require that this feature be enabled to build qemu
> + 2. Check for `-sandbox` support at runtime before 
> +
> +[lwn-seccomp]: https://lwn.net/Articles/738694/
> +
> +### Disks
> +
> +The chroot (and seccomp?) happens late enough such that QEMU can
> +initialize itself and open its disks. If you want to add a disk at run
> +time via or insert a CD, you can't pass a path because QEMU is
> +chrooted. Instead use the add-fd QMP command and use
> +/dev/fdset/<fdset-id> as the path.
> +
> +A further layer of restriction could be to set RLIMIT_NOFILES to '0',
> +and hand all disks over QMP.
> +
> +## Migration
> +
> +When calling xen-save-devices-state, since QEMU is running in a chroot
> +it is not useful to pass a filename (it doesn't even have write access
> +inside the chroot). Instead, give it an open fd using the add-fd
> +mechanism.
> +
> +Additionally, all the restrictions need to be applied to the qemu
> +started up on the post-migration side.  One issue that needs to be
> +solved is how to signal the toolstack on restore that qemu is ready
> +for the domain to be started (since this is normally done via
> +xenstore, and at this point the xenstore connections will have been
> +closed).
> +
> +### Network namespacing (Linux only)
> +
> +Enter QEMU into its own network namespace (in addition to mount & IPC
> +namespaces):
> +
> +    unshare(CLONE_NEWNET);
> +
> +QEMU does actually use the network namespace as a Xen DM for two
> +purposes: 1) To set up network tap devices 2) To open vnc connections.
> +
> +#### Network
> +
> +If QEMU runs in its own network namespace, it can't open the tap
> +device itself because the interface won't be visible outside of its
> +own namespace. So instead, have the toolstack open the device and pass
> +it as an fd on the command-line:
> +
> +    -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>
> +
> +#### VNC
> +
> +If QEMU runs in its own network namespace, it is not straightforward
> +to listen on a TCP socket outside of its own network namespace. One
> +option would be to use VNC over a UNIX socket:
> +
> +    -vnc unix:/var/run/xen/vnc-<domid>
> +
> +However, this would break functionality in the general case; I think
> +we need to have the toolstack open a socket and pass the fd to QEMU
> +(which requires changes to QEMU).
> +
> diff --git a/docs/features/qemu-deprivilege.pandoc 
> b/docs/features/qemu-deprivilege.pandoc
> new file mode 100644
> index 0000000000..6fb48f3e40
> --- /dev/null
> +++ b/docs/features/qemu-deprivilege.pandoc
> @@ -0,0 +1,94 @@
> +% QEMU Deprivileging / dm_restrict
> +% Revision 1
> +
> +\clearpage
> +
> +# Basics
> +
> +---------------- ----------------------------------------------------
> +         Status: **Tech Preview**
> +
> +Architecture(s): x86
> +
> +   Component(s): toolstack
> +
> +---------------- ----------------------------------------------------
> +
> +# Overview
> +
> +By default, the QEMU device model is run in domain 0.  If an attacker
> +can gain control of a QEMU process, it could easily take control of a
> +system.
> +
> +dm_restrict is a set of operations to restrict QEMU running in domain
> +0.  It consists of two halves:
> +
> + 1. Mechanisms to restrict QEMU to only being able to affect its own
> +domain
> + 2. Mechanisms to restruct QEMU's ability to interact with domain 0.
> +
> +# User details
> +
> +## Getting the right versions of software
> +
> +Linux: 4.11+
> +
> +Qemu: 3.0+ (Or the version that comes with Xen 4.12+)
> +
> +## Setting up a userid range
> +
> +For maximum security, libxl needs to run the devicemodel for each
> +domain under a user id (UID) corresponding to its domain id.  There
> +are 32752 possible domain IDs, and so libxl needs 32752 user ids set
> +aside for it.
> +
> +The simplest and most effective way to do this is to allocate a
> +contiguous block of UIDs, and create a single user named
> +`xen-qemuuser-range-base` with the first UID.  For example, under Debian:
> +
> +    adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base
> +
> +NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up
> +to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
> +UIDs however; UID 65535 is reserved for an invalid value, and 65534 is
> +normally allocated to "nobody".  Additionally, some container systems
> +have proposed using the upper 32 bits of the uid for a container ID.
> +
> +Another, less-secure way is to run all QEMUs as the same UID.  To do
> +this, create a user named `xen-qemuuser-shared`; for example:
> +
> +    adduser --no-create-home --system xen-qemuuser-shared
> +
> +## Domain config changes
> +
> +The core domain config change is to add the following line to the
> +domain configuration:
> +
> +    dm_restrict=1
> +
> +This will perform a number of restrictions, outlined below in the
> +'Technical details' section.
> +
> +# Technical details
> +
> +See docs/design/qemu-deprivilege.md for technical details.
> +
> +# Limitations
> +
> +The following features still need to be implemented:
> + * Inserting a new cdrom while the guest is running (xl cdrom-insert)
> + * Migration / save / restore
> +
> +Additionally, getting PCI passthrough to work securely would require a
> +significant rework of how passthrough works at the moment.  It may be
> +implemented at some point but is not a near-term priority.
> +
> +See SUPPORT.md for security support status.
> +
> +# History
> +
> +------------------------------------------------------------------------
> +Date       Revision Version  Notes
> +---------- -------- -------- -------------------------------------------
> +2018-09-14 1        Xen 4.12 Imported from docs/misc
> +---------- -------- -------- -------------------------------------------
> diff --git a/docs/misc/qemu-deprivilege.txt b/docs/misc/qemu-deprivilege.txt
> deleted file mode 100644
> index 58b86a3908..0000000000
> --- a/docs/misc/qemu-deprivilege.txt
> +++ /dev/null
> @@ -1,36 +0,0 @@
> -For security reasons, libxl tries to pass a non-root username to QEMU as
> -argument. During initialization QEMU calls setuid and setgid with the
> -user ID and the group ID of the user passed as argument.
> -Libxl looks for the following users in this order:
> -
> -1) a user named "xen-qemuuser-domid$domid",
> -Where $domid is the domid of the domain being created.
> -This requires the reservation of 65535 uids from xen-qemuuser-domid1
> -to xen-qemuuser-domid65535. To use this mechanism, you might want to
> -create a large number of users at installation time. For example:
> -
> -for ((i=1; i<65536; i++))
> -do
> -    adduser --no-create-home --system xen-qemuuser-domid$i
> -done
> -
> -You might want to consider passing --group to adduser to create a new
> -group for each new user.
> -
> -
> -2) a user named "xen-qemuuser-shared"
> -As a fall back if both 1) fails, libxl will use a single user for
> -all QEMU instances. The user is named xen-qemuuser-shared. This is
> -less secure but still better than running QEMU as root. Using this is as
> -simple as creating just one more user on your host:
> -
> -adduser --no-create-home --system xen-qemuuser-shared
> -
> -
> -3) root
> -As a last resort, libxl will start QEMU as root.
> -
> -
> -Please note that running QEMU as non-root causes several features like
> -migration and PCI passthrough to not work properly and may prevent the guest
> -from booting.
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.