[Xen-devel] [PATCH 1/5] docs/qemu-deprivilege: Revise and update with status and future plans

docs/qemu-deprivilege.txt had some basic instructions for using
dm_restrict, but it was incomplete, misleading, and stale.

Update the docs in a number of ways.

First, separate user-facing documentation and technical description
into docs/features and docs/design, respectively.

In the feature doc:

* Introduce a section mentioning minimim versions of Linux, Xen, and
qemu required (TBD)

* Fix the discussion of qemu userid.  Mention xen-qemuuser-range-base,
and provide example shell code that actually has some hope of working
(instead of failing out after creating 900 userids).

* Describe how to enable restrictions, as well as features which
probably don't or definitely don't work.

In the design doc, introduce a "Technical Details" section which
describes specifically what restrictions are currently done, and also
what restrictions we are looking at doing in the future.

The idea here is that as we implement the various items for the
future, we move them from "Restrictions still to do" to "Restrictions
done".  This can also act as a design document -- a place for public
discussion of what can or should be done and how.

Also add an entry to SUPPORT.md.

Signed-off-by: George Dunlap <george.dunlap@xxxxxxxxxx>
Changes since v2:
- Extraneous privcmd / evtchn instances aren't closed
- Expand description of how to test fd deprivileging
- Rework and clarify two namespace sections, give reference for QEMU NAK
- Add more information about migration technical challenges
- In UID section, mention possibility of container ID collisions.
- Fix name of design document.
- Add SUPPORT.md statement.  Specify Linux, to make sure that FreeBSD is
  evaluated separately.
- Mention that `-sandbox` is a blacklist and why

Changes since v1:
- Break into two, and move into appropriate directories (rather than 'misc')
- Updated version requirements
- Distinguish between features which "don't yet work" and features which we 
never expect to work
- Update description of xen-restrict functionality
- Reorder and expand further restrictions
- Make it more clear which restrictions are available on Linux only
- Include detailed description of how to kill a process
- Add RLIMIT_NPROC as something we can do without further changes to qemu
- Document the need to check for the sandbox feature before using it

Thank you to Ross Lagerwall, whose description of what XenServer is
doing formed much of the basis for the text here.

 SUPPORT.md                            |  20 ++
 docs/designs/qemu-deprivilege.md      | 322 ++++++++++++++++++++++++++
 docs/features/qemu-deprivilege.pandoc |  94 ++++++++
 docs/misc/qemu-deprivilege.txt        |  36 ---
 4 files changed, 436 insertions(+), 36 deletions(-)
 create mode 100644 docs/designs/qemu-deprivilege.md
 create mode 100644 docs/features/qemu-deprivilege.pandoc
 delete mode 100644 docs/misc/qemu-deprivilege.txt

diff --git a/SUPPORT.md b/SUPPORT.md
index 3727446b83..b5e7e44fb3 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -525,6 +525,26 @@ Vulnerabilities of a device model stub domain
 to a hostile driver domain (either compromised or untrusted)
 are excluded from security support.
+### Device Model Deprivileging
+    Status, Linux: Tech Preview, with limited support
+This means adding extra restrictions to a device model running in
+domain 0 in order to prevent a compromised device model to attack the
+rest of the system.
+"Tech preview with limited support" means we will not issue XSAs for
+the _additional_ functionality provided by the feature; but we will
+issue XSAs in the event that enabling this feature opens up a security
+hole that would not be present without the feature disabled.
+For example, while this is classified as tech preview, a bug in libxl
+which failed to change the user ID of QEMU would not receive an XSA,
+since without this feature the user ID wouldn't be changed. But a
+change which made it possible for a compromised guest to read
+arbitrary files on the host filesystem without compromising QEMU would
+be issued an XSA, since that does weaken security.
 ### KCONFIG Expert
     Status: Experimental
diff --git a/docs/designs/qemu-deprivilege.md b/docs/designs/qemu-deprivilege.md
new file mode 100644
index 0000000000..d3c6495030
--- /dev/null
+++ b/docs/designs/qemu-deprivilege.md
@@ -0,0 +1,322 @@
+# Introduction
+The goal of deprilvileging qemu is this: Even if there is a bug (for
+example in qemu) which permits a domain to gain control of the device
+model, the compromised device model process is prevented from
+violating the system's overall security properties.  Ie, a guest
+cannot "escape" from the virtualisation by using a qemu bug.
+This document lists the various technical measures which we either
+have taken, or plan to take to effect this goal.  Some of them are
+required to be considered secure (that is, there are known attack
+vectors which they close); others are "just in case" (that is, there
+are no known attack vectors, but we perform the restrictions to reduce
+the possibility of unknown attack vectors).
+# Restrictions done
+The following restrictions are currently implemented.
+## Having qemu switch user
+'''Description''': As mentioned above, having QEMU switch to a
+non-root user, one per domain id.  Not being the root user limits what
+a compromised QEMU process can do to the system, and having one user
+per domain id limits what a comprimised QEMU process can do to the
+QEMU processes of other VMs.
+'''Implementation''': The toolstack adds the following to the qemu 
+    -runas <uid>:<gid>
+'''How to test''':
+    grep /proc/<qpid>/status [UG]id
+'''Testing Status''': Not tested
+## Xen library / file-descriptor restrictions
+'''Description''': Close and restrict Xen-related file descriptors.
+ * Close all xenstore-related file descriptors
+ * Make sure that all open instances of `privcmd` and `evtchn` file
+descriptors have had `IOCTL_PRIVCMD_RESTRICT` and
+`IOCTL_EVTCHN_RESTRICT_DOMID` ioctls called on them, respectively.
+FIXME: Double-check the correctness of the above
+'''Implementation''': Toolstack adds the following to the qemu command-line:
+    -xen-domid-restrict
+'''How to test''':
+Use `fishdescriptor` to pull a file descriptor from a running QEMU,
+then use `depriv-fd-checker` to check that it has the desired
+properties, and that hypercalls which are meant to fail do fail.  (In
+Debian `fishdescriptor` can be found in the binary package
+`chiark-scripts`; the `depriv-fd-checker` is included in the Xen
+source tree.)
+'''Testing status''': Tested
+# Restrictions / improvements still to do
+This lists potential restrictions still to do.  It is meant to be
+listed in order of ease of implementation, with low-hanging fruit
+## Chroot
+'''Description''': Qemu runs in its own chroot, such that even if it
+could call an 'open' command of some sort, there would be nothing for
+it to see.
+'''Implementation''': The toolstack creates a directory in the libxl 
"run-dir"; e.g.
+Then adds the following to the qemu command-line:
+    -chroot /var/run/xen/qemu-root-<domid>
+'''How to test''':  Check `/proc/<qpid>/root`
+'''Tested''': Not tested
+## Namespaces for unused functionality (Linux only)
+'''Description''': QEMU doesn't use the functionality associated with
+mount and IPC namespaces. (IPC namespaces contol non-file-based IPC
+mechanisms within the kernel; unix and network sockets are not
+affected by this.)  Making separate namespaces for these for QEMU
+won't affect normal operation, but it does mean that even if other
+restrictions fail, the process won't be able to even name system mount
+points or exsting non-file-based IPC descriptors to attempt to attack
+In theory this could be done in QEMU (similar to -sandbox, -runas,
+-chroot, and so on), but a patch doing this in QEMU was NAKed upstream
+(see [qemu-namespaces]). They preferred that this was done as a setup step by
+whatever executes QEMU; i.e., have the process which exec's QEMU first
+'''How to test''':  Check `/proc/<qpid>/ns/[ipc,mnt]`
+'''Tested''': Not tested
+### Basic RLIMITs
+'''Description''': A number of limits on the resources that a given
+process / userid is allowed to consume.  These can limit the ability
+of a compromised QEMU process to DoS domain 0 by exhausting various
+resources available to it.
+Limits that can be implemented immediately without much effort:
+ - RLIMIT_FSIZE` (file size) to 256KiB.
+ - RLIMIT_NPROC (after uid changes to a unique uid)
+Probably not necessary but why not:
+Note: mlock() is used by QEMU only when both "realtime" and "mlock"
+are specified; this does not apply to QEMU running as a Xen DM.
+'''How to test''': Check `/proc/<qpid>/limits`
+'''Tested''': Not tested
+### Further RLIMITs
+RLIMIT_AS limits the total amount of memory; but this includes the
+virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
+fiddles with this; it would be straightforward to make it *set* the
+rlimit to what it thinks a sensible limit is.
+Other things that would take some cleverness / changes to QEMU to
+utilize due to ordering constrants:
+ - RLIMIT_NOFILES (after all necessary files are opened)
+### libxl UID cleanup
+'''Description''': Domain IDs are reused, and thus restricted UIDs are
+reused.  If a compromised QEMU can fork (due to seccomp or
+RLIMIT_NPROC limits being ineffective for some reason), it may avoid
+being killed when its domain dies, then wait until the domain ID is
+reused again, at which point it will have control over the domain in
+question (which probably belongs to someone else).
+libxl should kill all UIDs associated with a domain both when the VM
+is destroyed, and before starting a VM with the same UID.
+'''Implementation''': This is unnecessarily tricky.
+The kill() system call can have three kinds of targets:
+ - A single pid
+ - A process group
+ - "Every process except me to which I am allowed to send a signal" (-1)
+Targeting a single pid is racy and likely to be beaten by the
+following loop:
+    while(1) {
+        if(fork())
+           _exit(0);
+    }    
+That is, by the time you've read the process list and found the
+process id you want to kill, that process has exited and there is a
+new process whose pid you don't know about.
+Targeting a process group will be ineffective, as unprivileged
+processes are allowed to make their own process groups.
+kill(-1) can be used but must be done with care.  Consider the
+following code, for example:
+    setuid(target_uid);
+    kill(-1, 9);
+This looks like it will do the trick; but by setting all of the user
+ids (effective, real, and saved), it opens the 'killing' process up to
+being killed by the target process:
+    while(1) {
+        if(fork())
+            _exit(0);
+        else
+            kill(-1, 9);
+    }
+Fortunately there is an assymetry we can take advantage of.  From the
+POSIX spec:
+> For a process to have permission to send a signal to a process
+> designated by pid, unless the sending process has appropriate
+> privileges, the real or effective user ID of the sending process shall
+> match the real or saved set-user-ID of the receiving process.
+The solution is to allocate a second "reaper" uid that is only used to kill
+target processes.  We set the euid of the killing process to the `target_uid`,
+but the ruid of the killing process to `reaper_uid`, leaving the suid of the
+killing process as 0:
+    setresuid(reaper_uid, target_uid, 0);
+    kill(-1, 9);
+NOTE: We cannot use `setreuid(reaper_uid, target_uid)` here, as that
+will set *both* euid *and* suid to `target_uid`, making the killing
+process vulnerable to the target process again.
+Since this will kill all other `reaper_uid` processes as well, we must
+either allocate a separate `reaper_uid` per domain, or use locking to
+ensure that only one killing process is active at a time.
+## libxl: Treat QMP connection as untrusted
+'''Description''': Currently libxl talks with QEMU via QMP; but its
+interactions have not historically considered from a security point of
+view.  For example, qmp_synchronous_send() waits for a response from
+QEMU, which a compromised QEMU could simply not send (thus preventing
+the toolstack from making forward progress).
+'''Implementation''': Audit toolstack interactions with QEMU which
+happen after the guest has started running, and assume QEMU has been
+### seccomp filtering (Linux only)
+'''Description''': Turn on seccomp filtering to disable syscalls which
+QEMU doesn't need. 
+'''Implementation''': Enable from the command-line:
+    -sandbox 
+`elevateprivileges` is currently required to allow `-runas` to work.
+Removing this requirement would mean making sure that the uid change
+happened before the seccomp2 call, perhaps by changing the uid before
+executing QEMU.  (But this would then require other changes to create
+the QMP socket, VNC socket, and so on).
+It should be noted that `-sandbox` is implemented as a blacklist, not
+a whitelist; that is, it disables known-unsed functionality which may
+be harmful, rather than disabling all functionality except that known
+to be safe and needed.  This is unfortunately necessary since qemu
+doesn't know what system calls libraries might end up making.  (See
+[lwn-seccomp] for a more complete discussion.)
+This feature is not on by default and may not be available in all
+environments.  We therefore need to either:
+ 1. Require that this feature be enabled to build qemu
+ 2. Check for `-sandbox` support at runtime before 
+[lwn-seccomp]: https://lwn.net/Articles/738694/
+### Disks
+The chroot (and seccomp?) happens late enough such that QEMU can
+initialize itself and open its disks. If you want to add a disk at run
+time via or insert a CD, you can't pass a path because QEMU is
+chrooted. Instead use the add-fd QMP command and use
+/dev/fdset/<fdset-id> as the path.
+A further layer of restriction could be to set RLIMIT_NOFILES to '0',
+and hand all disks over QMP.
+## Migration
+When calling xen-save-devices-state, since QEMU is running in a chroot
+it is not useful to pass a filename (it doesn't even have write access
+inside the chroot). Instead, give it an open fd using the add-fd
+Additionally, all the restrictions need to be applied to the qemu
+started up on the post-migration side.  One issue that needs to be
+solved is how to signal the toolstack on restore that qemu is ready
+for the domain to be started (since this is normally done via
+xenstore, and at this point the xenstore connections will have been
+### Network namespacing (Linux only)
+Enter QEMU into its own network namespace (in addition to mount & IPC
+    unshare(CLONE_NEWNET);
+QEMU does actually use the network namespace as a Xen DM for two
+purposes: 1) To set up network tap devices 2) To open vnc connections.
+#### Network
+If QEMU runs in its own network namespace, it can't open the tap
+device itself because the interface won't be visible outside of its
+own namespace. So instead, have the toolstack open the device and pass
+it as an fd on the command-line:
+    -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>
+#### VNC
+If QEMU runs in its own network namespace, it is not straightforward
+to listen on a TCP socket outside of its own network namespace. One
+option would be to use VNC over a UNIX socket:
+    -vnc unix:/var/run/xen/vnc-<domid>
+However, this would break functionality in the general case; I think
+we need to have the toolstack open a socket and pass the fd to QEMU
+(which requires changes to QEMU).
diff --git a/docs/features/qemu-deprivilege.pandoc 
new file mode 100644
index 0000000000..6fb48f3e40
--- /dev/null
+++ b/docs/features/qemu-deprivilege.pandoc
@@ -0,0 +1,94 @@
+% QEMU Deprivileging / dm_restrict
+% Revision 1
+# Basics
+---------------- ----------------------------------------------------
+         Status: **Tech Preview**
+Architecture(s): x86
+   Component(s): toolstack
+---------------- ----------------------------------------------------
+# Overview
+By default, the QEMU device model is run in domain 0.  If an attacker
+can gain control of a QEMU process, it could easily take control of a
+dm_restrict is a set of operations to restrict QEMU running in domain
+0.  It consists of two halves:
+ 1. Mechanisms to restrict QEMU to only being able to affect its own
+ 2. Mechanisms to restruct QEMU's ability to interact with domain 0.
+# User details
+## Getting the right versions of software
+Linux: 4.11+
+Qemu: 3.0+ (Or the version that comes with Xen 4.12+)
+## Setting up a userid range
+For maximum security, libxl needs to run the devicemodel for each
+domain under a user id (UID) corresponding to its domain id.  There
+are 32752 possible domain IDs, and so libxl needs 32752 user ids set
+aside for it.
+The simplest and most effective way to do this is to allocate a
+contiguous block of UIDs, and create a single user named
+`xen-qemuuser-range-base` with the first UID.  For example, under Debian:
+    adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base
+NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up
+to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
+UIDs however; UID 65535 is reserved for an invalid value, and 65534 is
+normally allocated to "nobody".  Additionally, some container systems
+have proposed using the upper 32 bits of the uid for a container ID.
+Another, less-secure way is to run all QEMUs as the same UID.  To do
+this, create a user named `xen-qemuuser-shared`; for example:
+    adduser --no-create-home --system xen-qemuuser-shared
+## Domain config changes
+The core domain config change is to add the following line to the
+domain configuration:
+    dm_restrict=1
+This will perform a number of restrictions, outlined below in the
+'Technical details' section.
+# Technical details
+See docs/design/qemu-deprivilege.md for technical details.
+# Limitations
+The following features still need to be implemented:
+ * Inserting a new cdrom while the guest is running (xl cdrom-insert)
+ * Migration / save / restore
+Additionally, getting PCI passthrough to work securely would require a
+significant rework of how passthrough works at the moment.  It may be
+implemented at some point but is not a near-term priority.
+See SUPPORT.md for security support status.
+# History
+Date       Revision Version  Notes
+---------- -------- -------- -------------------------------------------
+2018-09-14 1        Xen 4.12 Imported from docs/misc
+---------- -------- -------- -------------------------------------------
diff --git a/docs/misc/qemu-deprivilege.txt b/docs/misc/qemu-deprivilege.txt
deleted file mode 100644
index 58b86a3908..0000000000
--- a/docs/misc/qemu-deprivilege.txt
+++ /dev/null
@@ -1,36 +0,0 @@
-For security reasons, libxl tries to pass a non-root username to QEMU as
-argument. During initialization QEMU calls setuid and setgid with the
-user ID and the group ID of the user passed as argument.
-Libxl looks for the following users in this order:
-1) a user named "xen-qemuuser-domid$domid",
-Where $domid is the domid of the domain being created.
-This requires the reservation of 65535 uids from xen-qemuuser-domid1
-to xen-qemuuser-domid65535. To use this mechanism, you might want to
-create a large number of users at installation time. For example:
-for ((i=1; i<65536; i++))
-    adduser --no-create-home --system xen-qemuuser-domid$i
-You might want to consider passing --group to adduser to create a new
-group for each new user.
-2) a user named "xen-qemuuser-shared"
-As a fall back if both 1) fails, libxl will use a single user for
-all QEMU instances. The user is named xen-qemuuser-shared. This is
-less secure but still better than running QEMU as root. Using this is as
-simple as creating just one more user on your host:
-adduser --no-create-home --system xen-qemuuser-shared
-3) root
-As a last resort, libxl will start QEMU as root.
-Please note that running QEMU as non-root causes several features like
-migration and PCI passthrough to not work properly and may prevent the guest
-from booting.

