[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 1/6] docs/qemu-deprivilege: Revise and update with status and future plans



> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxxx] On Behalf
> Of George Dunlap
> Sent: 05 November 2018 18:07
> To: xen-devel@xxxxxxxxxxxxxxxxxxxx
> Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>; Wei Liu
> <wei.liu2@xxxxxxxxxx>; Konrad Wilk <konrad.wilk@xxxxxxxxxx>; Andrew Cooper
> <Andrew.Cooper3@xxxxxxxxxx>; Tim (Xen.org) <tim@xxxxxxx>; George Dunlap
> <George.Dunlap@xxxxxxxxxx>; Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>;
> Julien Grall <julien.grall@xxxxxxx>; Jan Beulich <jbeulich@xxxxxxxx>;
> Anthony Perard <anthony.perard@xxxxxxxxxx>; Ian Jackson
> <Ian.Jackson@xxxxxxxxxx>
> Subject: [Xen-devel] [PATCH v4 1/6] docs/qemu-deprivilege: Revise and
> update with status and future plans
> 
> docs/qemu-deprivilege.txt had some basic instructions for using
> dm_restrict, but it was incomplete, misleading, and stale.
> 
> Update the docs in a number of ways.
> 
> First, separate user-facing documentation and technical description
> into docs/features and docs/design, respectively.
> 
> In the feature doc:
> 
> * Introduce a section mentioning minimim versions of Linux, Xen, and
> qemu required (TBD)
> 
> * Fix the discussion of qemu userid.  Mention xen-qemuuser-range-base,
> and provide example shell code that actually has some hope of working
> (instead of failing out after creating 900 userids).
> 
> * Describe how to enable restrictions, as well as features which
> probably don't or definitely don't work.
> 
> In the design doc, introduce a "Technical Details" section which
> describes specifically what restrictions are currently done, and also
> what restrictions we are looking at doing in the future.
> 
> The idea here is that as we implement the various items for the
> future, we move them from "Restrictions still to do" to "Restrictions
> done".  This can also act as a design document -- a place for public
> discussion of what can or should be done and how.
> 
> Also add an entry to SUPPORT.md.
> 
> Signed-off-by: George Dunlap <george.dunlap@xxxxxxxxxx>
> ---
> Changes since v3:
> - Fix typo (32->16)
> - Use an example value not close to the `nobody` uids, but still a
>   multiple of 2^16.
> - Mention that using a multiple of 2^16 may have advantages.
> - Have the example create a group as well
> - Reorganize two comments on the "range-base" method for clarity
> 
> Changes since v2:
> - Extraneous privcmd / evtchn instances aren't closed
> - Expand description of how to test fd deprivileging
> - Rework and clarify two namespace sections, give reference for QEMU NAK
> - Add more information about migration technical challenges
> - In UID section, mention possibility of container ID collisions.
> - Fix name of design document.
> - Add SUPPORT.md statement.  Specify Linux, to make sure that FreeBSD is
>   evaluated separately.
> - Mention that `-sandbox` is a blacklist and why
> 
> Changes since v1:
> - Break into two, and move into appropriate directories (rather than
> 'misc')
> - Updated version requirements
> - Distinguish between features which "don't yet work" and features which
> we never expect to work
> - Update description of xen-restrict functionality
> - Reorder and expand further restrictions
> - Make it more clear which restrictions are available on Linux only
> - Include detailed description of how to kill a process
> - Add RLIMIT_NPROC as something we can do without further changes to qemu
> - Document the need to check for the sandbox feature before using it
> 
> Thank you to Ross Lagerwall, whose description of what XenServer is
> doing formed much of the basis for the text here.
> 
> CC: Ian Jackson <ian.jackson@xxxxxxxxxx>
> CC: Wei Liu <wei.liu2@xxxxxxxxxx>
> CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> CC: Jan Beulich <jbeulich@xxxxxxxx>
> CC: Tim Deegan <tim@xxxxxxx>
> CC: Konrad Wilk <konrad.wilk@xxxxxxxxxx>
> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> CC: Julien Grall <julien.grall@xxxxxxx>
> CC: Anthony Perard <anthony.perard@xxxxxxxxxx>
> CC: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>
> ---
>  docs/designs/qemu-deprivilege.md      | 322 ++++++++++++++++++++++++++
>  docs/features/qemu-deprivilege.pandoc | 101 ++++++++
>  docs/misc/qemu-deprivilege.txt        |  36 ---
>  3 files changed, 423 insertions(+), 36 deletions(-)
>  create mode 100644 docs/designs/qemu-deprivilege.md
>  create mode 100644 docs/features/qemu-deprivilege.pandoc
>  delete mode 100644 docs/misc/qemu-deprivilege.txt
> 
> diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-
> deprivilege.md
> new file mode 100644
> index 0000000000..787ae1ac7c
> --- /dev/null
> +++ b/docs/designs/qemu-deprivilege.md
> @@ -0,0 +1,322 @@
> +# Introduction
> +
> +The goal of deprilvileging qemu is this: Even if there is a bug (for
> +example in qemu) which permits a domain to gain control of the device
> +model, the compromised device model process is prevented from
> +violating the system's overall security properties.  Ie, a guest
> +cannot "escape" from the virtualisation by using a qemu bug.
> +
> +This document lists the various technical measures which we either
> +have taken, or plan to take to effect this goal.  Some of them are
> +required to be considered secure (that is, there are known attack
> +vectors which they close); others are "just in case" (that is, there
> +are no known attack vectors, but we perform the restrictions to reduce
> +the possibility of unknown attack vectors).
> +
> +# Restrictions done
> +
> +The following restrictions are currently implemented.
> +
> +## Having qemu switch user
> +
> +'''Description''': As mentioned above, having QEMU switch to a
> +non-root user, one per domain id.  Not being the root user limits what
> +a compromised QEMU process can do to the system, and having one user
> +per domain id limits what a comprimised QEMU process can do to the
> +QEMU processes of other VMs.
> +
> +'''Implementation''': The toolstack adds the following to the qemu
> command-line:
> +
> +    -runas <uid>:<gid>
> +
> +'''How to test''':
> +
> +    grep /proc/<qpid>/status [UG]id
> +
> +'''Testing Status''': Not tested
> +
> +## Xen library / file-descriptor restrictions
> +
> +'''Description''': Close and restrict Xen-related file descriptors.
> +Specifically:
> + * Close all xenstore-related file descriptors
> + * Make sure that all open instances of `privcmd` and `evtchn` file
> +descriptors have had `IOCTL_PRIVCMD_RESTRICT` and
> +`IOCTL_EVTCHN_RESTRICT_DOMID` ioctls called on them, respectively.
> +
> +FIXME: Double-check the correctness of the above

Presumably this should go away before commit ^

> +
> +'''Implementation''': Toolstack adds the following to the qemu command-
> line:
> +
> +    -xen-domid-restrict
> +
> +'''How to test''':
> +
> +Use `fishdescriptor` to pull a file descriptor from a running QEMU,
> +then use `depriv-fd-checker` to check that it has the desired
> +properties, and that hypercalls which are meant to fail do fail.  (In
> +Debian `fishdescriptor` can be found in the binary package
> +`chiark-scripts`; the `depriv-fd-checker` is included in the Xen
> +source tree.)
> +
> +'''Testing status''': Tested
> +
> +# Restrictions / improvements still to do
> +
> +This lists potential restrictions still to do.  It is meant to be
> +listed in order of ease of implementation, with low-hanging fruit
> +first.
> +
> +## Chroot
> +
> +'''Description''': Qemu runs in its own chroot, such that even if it
> +could call an 'open' command of some sort, there would be nothing for
> +it to see.
> +
> +'''Implementation''': The toolstack creates a directory in the libxl
> "run-dir"; e.g.
> +`/var/run/xen/qemu-root-<domid>`
> +
> +Then adds the following to the qemu command-line:
> +
> +    -chroot /var/run/xen/qemu-root-<domid>
> +
> +'''How to test''':  Check `/proc/<qpid>/root`
> +
> +'''Tested''': Not tested
> +
> +## Namespaces for unused functionality (Linux only)
> +
> +'''Description''': QEMU doesn't use the functionality associated with
> +mount and IPC namespaces. (IPC namespaces contol non-file-based IPC
> +mechanisms within the kernel; unix and network sockets are not
> +affected by this.)  Making separate namespaces for these for QEMU
> +won't affect normal operation, but it does mean that even if other
> +restrictions fail, the process won't be able to even name system mount
> +points or existing non-file-based IPC descriptors to attempt to attack
> +them.
> +
> +'''Implementation''':
> +
> +In theory this could be done in QEMU (similar to -sandbox, -runas,
> +-chroot, and so on), but a patch doing this in QEMU was NAKed upstream
> +(see [qemu-namespaces]). They preferred that this was done as a setup
> step by
> +whatever executes QEMU; i.e., have the process which exec's QEMU first
> +call:
> +
> +    unshare(CLONE_NEWNS | CLONE_NEWIPC)
> +
> +'''How to test''':  Check `/proc/<qpid>/ns/[ipc,mnt]`
> +
> +'''Tested''': Not tested
> +
> +[qemu-namespaces]: https://lists.gnu.org/archive/html/qemu-devel/2017-
> 10/msg04723.html
> +
> +### Basic RLIMITs
> +
> +'''Description''': A number of limits on the resources that a given
> +process / userid is allowed to consume.  These can limit the ability
> +of a compromised QEMU process to DoS domain 0 by exhausting various
> +resources available to it.
> +
> +'''Implementation'''
> +
> +Limits that can be implemented immediately without much effort:
> + - RLIMIT_FSIZE` (file size) to 256KiB.
> + - RLIMIT_NPROC (after uid changes to a unique uid)
> +
> +Probably not necessary but why not:
> + - RLIMIT_CORE: 0
> + - RLIMIT_MSGQUEUE: 0
> + - RLIMIT_LOCKS: 0
> + - RLIMIT_MEMLOCK: 0
> +
> +Note: mlock() is used by QEMU only when both "realtime" and "mlock"
> +are specified; this does not apply to QEMU running as a Xen DM.
> +
> +'''How to test''': Check `/proc/<qpid>/limits`
> +
> +'''Tested''': Not tested
> +
> +### Further RLIMITs
> +
> +RLIMIT_AS limits the total amount of memory; but this includes the
> +virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
> +fiddles with this; it would be straightforward to make it *set* the
> +rlimit to what it thinks a sensible limit is.
> +
> +Other things that would take some cleverness / changes to QEMU to
> +utilize due to ordering constrants:
> + - RLIMIT_NOFILES (after all necessary files are opened)
> +
> +### libxl UID cleanup
> +
> +'''Description''': Domain IDs are reused, and thus restricted UIDs are
> +reused.  If a compromised QEMU can fork (due to seccomp or
> +RLIMIT_NPROC limits being ineffective for some reason), it may avoid
> +being killed when its domain dies, then wait until the domain ID is
> +reused again, at which point it will have control over the domain in
> +question (which probably belongs to someone else).
> +
> +libxl should kill all UIDs associated with a domain both when the VM
> +is destroyed, and before starting a VM with the same UID.
> +
> +'''Implementation''': This is unnecessarily tricky.
> +
> +The kill() system call can have three kinds of targets:
> + - A single pid
> + - A process group
> + - "Every process except me to which I am allowed to send a signal" (-1)
> +
> +Targeting a single pid is racy and likely to be beaten by the
> +following loop:
> +
> +    while(1) {
> +        if(fork())
> +         _exit(0);
> +    }
> +
> +That is, by the time you've read the process list and found the
> +process id you want to kill, that process has exited and there is a
> +new process whose pid you don't know about.
> +
> +Targeting a process group will be ineffective, as unprivileged
> +processes are allowed to make their own process groups.
> +
> +kill(-1) can be used but must be done with care.  Consider the
> +following code, for example:
> +
> +    setuid(target_uid);
> +    kill(-1, 9);
> +
> +This looks like it will do the trick; but by setting all of the user
> +ids (effective, real, and saved), it opens the 'killing' process up to
> +being killed by the target process:
> +
> +    while(1) {
> +        if(fork())
> +            _exit(0);
> +        else
> +            kill(-1, 9);
> +    }
> +
> +Fortunately there is an assymetry we can take advantage of.  From the
> +POSIX spec:
> +
> +> For a process to have permission to send a signal to a process
> +> designated by pid, unless the sending process has appropriate
> +> privileges, the real or effective user ID of the sending process shall
> +> match the real or saved set-user-ID of the receiving process.
> +
> +The solution is to allocate a second "reaper" uid that is only used to
> kill
> +target processes.  We set the euid of the killing process to the
> `target_uid`,
> +but the ruid of the killing process to `reaper_uid`, leaving the suid of
> the
> +killing process as 0:
> +
> +    setresuid(reaper_uid, target_uid, 0);
> +    kill(-1, 9);
> +
> +NOTE: We cannot use `setreuid(reaper_uid, target_uid)` here, as that
> +will set *both* euid *and* suid to `target_uid`, making the killing
> +process vulnerable to the target process again.
> +
> +Since this will kill all other `reaper_uid` processes as well, we must
> +either allocate a separate `reaper_uid` per domain, or use locking to
> +ensure that only one killing process is active at a time.
> +
> +## libxl: Treat QMP connection as untrusted
> +
> +'''Description''': Currently libxl talks with QEMU via QMP; but its
> +interactions have not historically considered from a security point of
> +view.  For example, qmp_synchronous_send() waits for a response from
> +QEMU, which a compromised QEMU could simply not send (thus preventing
> +the toolstack from making forward progress).
> +
> +'''Implementation''': Audit toolstack interactions with QEMU which
> +happen after the guest has started running, and assume QEMU has been
> +compromised.
> +
> +### seccomp filtering (Linux only)
> +
> +'''Description''': Turn on seccomp filtering to disable syscalls which
> +QEMU doesn't need.
> +
> +'''Implementation''': Enable from the command-line:
> +
> +    -sandbox
> on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
> +
> +`elevateprivileges` is currently required to allow `-runas` to work.
> +Removing this requirement would mean making sure that the uid change
> +happened before the seccomp2 call, perhaps by changing the uid before
> +executing QEMU.  (But this would then require other changes to create
> +the QMP socket, VNC socket, and so on).
> +
> +It should be noted that `-sandbox` is implemented as a blacklist, not
> +a whitelist; that is, it disables known-unsed functionality which may
> +be harmful, rather than disabling all functionality except that known
> +to be safe and needed.  This is unfortunately necessary since qemu
> +doesn't know what system calls libraries might end up making.  (See
> +[lwn-seccomp] for a more complete discussion.)
> +
> +This feature is not on by default and may not be available in all
> +environments.  We therefore need to either:
> + 1. Require that this feature be enabled to build qemu
> + 2. Check for `-sandbox` support at runtime before
> +
> +[lwn-seccomp]: https://lwn.net/Articles/738694/
> +
> +### Disks
> +
> +The chroot (and seccomp?) happens late enough such that QEMU can
> +initialize itself and open its disks. If you want to add a disk at run
> +time via or insert a CD, you can't pass a path because QEMU is
> +chrooted. Instead use the add-fd QMP command and use
> +/dev/fdset/<fdset-id> as the path.
> +
> +A further layer of restriction could be to set RLIMIT_NOFILES to '0',
> +and hand all disks over QMP.
> +
> +## Migration
> +
> +When calling xen-save-devices-state, since QEMU is running in a chroot
> +it is not useful to pass a filename (it doesn't even have write access
> +inside the chroot). Instead, give it an open fd using the add-fd
> +mechanism.
> +
> +Additionally, all the restrictions need to be applied to the qemu
> +started up on the post-migration side.  One issue that needs to be
> +solved is how to signal the toolstack on restore that qemu is ready
> +for the domain to be started (since this is normally done via
> +xenstore, and at this point the xenstore connections will have been
> +closed).

I thought Anthony had fixed this now?

  Paul

> +
> +### Network namespacing (Linux only)
> +
> +Enter QEMU into its own network namespace (in addition to mount & IPC
> +namespaces):
> +
> +    unshare(CLONE_NEWNET);
> +
> +QEMU does actually use the network namespace as a Xen DM for two
> +purposes: 1) To set up network tap devices 2) To open vnc connections.
> +
> +#### Network
> +
> +If QEMU runs in its own network namespace, it can't open the tap
> +device itself because the interface won't be visible outside of its
> +own namespace. So instead, have the toolstack open the device and pass
> +it as an fd on the command-line:
> +
> +    -device rtl8139,netdev=tapnet0,mac=... -netdev
> tap,id=tapnet0,fd=<tapfd>
> +
> +#### VNC
> +
> +If QEMU runs in its own network namespace, it is not straightforward
> +to listen on a TCP socket outside of its own network namespace. One
> +option would be to use VNC over a UNIX socket:
> +
> +    -vnc unix:/var/run/xen/vnc-<domid>
> +
> +However, this would break functionality in the general case; I think
> +we need to have the toolstack open a socket and pass the fd to QEMU
> +(which requires changes to QEMU).
> +
> diff --git a/docs/features/qemu-deprivilege.pandoc b/docs/features/qemu-
> deprivilege.pandoc
> new file mode 100644
> index 0000000000..f941525189
> --- /dev/null
> +++ b/docs/features/qemu-deprivilege.pandoc
> @@ -0,0 +1,101 @@
> +% QEMU Deprivileging / dm_restrict
> +% Revision 1
> +
> +\clearpage
> +
> +# Basics
> +
> +---------------- ----------------------------------------------------
> +         Status: **Tech Preview**
> +
> +Architecture(s): x86
> +
> +   Component(s): toolstack
> +
> +---------------- ----------------------------------------------------
> +
> +# Overview
> +
> +By default, the QEMU device model is run in domain 0.  If an attacker
> +can gain control of a QEMU process, it could easily take control of a
> +system.
> +
> +dm_restrict is a set of operations to restrict QEMU running in domain
> +0.  It consists of two halves:
> +
> + 1. Mechanisms to restrict QEMU to only being able to affect its own
> +domain
> + 2. Mechanisms to restruct QEMU's ability to interact with domain 0.
> +
> +# User details
> +
> +## Getting the right versions of software
> +
> +Linux: 4.11+
> +
> +Qemu: 3.0+ (Or the version that comes with Xen 4.12+)
> +
> +## Setting up a group and userid range
> +
> +For maximum security, libxl needs to run the devicemodel for each
> +domain under a user id (UID) corresponding to its domain id.  There
> +are 32752 possible domain IDs, and so libxl needs 32752 user ids set
> +aside for it.  Setting up a group for all devicemodels to run at is
> +also recommended.
> +
> +The simplest and most effective way to do this is to allocate a
> +contiguous block of UIDs, and create a single user named
> +`xen-qemuuser-range-base` with the first UID.  For example, under
> +Debian:
> +
> +    adduser --system --uid 131072 --group --no-create-home xen-qemuuser-
> range-base
> +
> +Two comments on this method:
> +
> +  1. Most modern systems have 32-bit UIDs, and so can in theory go up
> +to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
> +UIDs however; UID 65535 is reserved for an invalid value, and 65534 is
> +normally allocated to "nobody".
> +  2. Additionally, some container systems have proposed using the
> +upper 16 bits of the uid for a container ID.  Using a multiple of 2^16
> +for the range base (as is done above) will result in all UIDs being
> +interpreted by such systems as a single container ID.
> +
> +Another, less-secure way is to run all QEMUs as the same UID.  To do
> +this, create a user named `xen-qemuuser-shared`; for example:
> +
> +    adduser --no-create-home --system xen-qemuuser-shared
> +
> +## Domain config changes
> +
> +The core domain config change is to add the following line to the
> +domain configuration:
> +
> +    dm_restrict=1
> +
> +This will perform a number of restrictions, outlined below in the
> +'Technical details' section.
> +
> +# Technical details
> +
> +See docs/design/qemu-deprivilege.md for technical details.
> +
> +# Limitations
> +
> +The following features still need to be implemented:
> + * Inserting a new cdrom while the guest is running (xl cdrom-insert)
> + * Migration / save / restore
> +
> +Additionally, getting PCI passthrough to work securely would require a
> +significant rework of how passthrough works at the moment.  It may be
> +implemented at some point but is not a near-term priority.
> +
> +See SUPPORT.md for security support status.
> +
> +# History
> +
> +------------------------------------------------------------------------
> +Date       Revision Version  Notes
> +---------- -------- -------- -------------------------------------------
> +2018-09-14 1        Xen 4.12 Imported from docs/misc
> +---------- -------- -------- -------------------------------------------
> diff --git a/docs/misc/qemu-deprivilege.txt b/docs/misc/qemu-
> deprivilege.txt
> deleted file mode 100644
> index 58b86a3908..0000000000
> --- a/docs/misc/qemu-deprivilege.txt
> +++ /dev/null
> @@ -1,36 +0,0 @@
> -For security reasons, libxl tries to pass a non-root username to QEMU as
> -argument. During initialization QEMU calls setuid and setgid with the
> -user ID and the group ID of the user passed as argument.
> -Libxl looks for the following users in this order:
> -
> -1) a user named "xen-qemuuser-domid$domid",
> -Where $domid is the domid of the domain being created.
> -This requires the reservation of 65535 uids from xen-qemuuser-domid1
> -to xen-qemuuser-domid65535. To use this mechanism, you might want to
> -create a large number of users at installation time. For example:
> -
> -for ((i=1; i<65536; i++))
> -do
> -    adduser --no-create-home --system xen-qemuuser-domid$i
> -done
> -
> -You might want to consider passing --group to adduser to create a new
> -group for each new user.
> -
> -
> -2) a user named "xen-qemuuser-shared"
> -As a fall back if both 1) fails, libxl will use a single user for
> -all QEMU instances. The user is named xen-qemuuser-shared. This is
> -less secure but still better than running QEMU as root. Using this is as
> -simple as creating just one more user on your host:
> -
> -adduser --no-create-home --system xen-qemuuser-shared
> -
> -
> -3) root
> -As a last resort, libxl will start QEMU as root.
> -
> -
> -Please note that running QEMU as non-root causes several features like
> -migration and PCI passthrough to not work properly and may prevent the
> guest
> -from booting.
> --
> 2.19.1
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxxx
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.