[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans



On 03/22/2018 06:24 PM, George Dunlap wrote:
snip
-for ((i=1; i<65536; i++))
+# Introduction
+
+# Setup
+
+## Getting the right versions of software
+
+Linux 4.XX

(For dom0 kernel...)

Requires 4.11 for the ability to restrict dmop calls.

+
+Xen 4.XX

Requires 4.11 to get required dmop calls to make VGA work.

+
+Qemu: Requires patches not yet in any release
+
+## Setting up a userid range
+
+For maximum security, libxl needs to run the devicemodel for each
+domain under a user id (UID) corresponding to its domain id.  There
+are 32752 possible domain IDs, and so libxl needs 32752 user ids set
+aside for it.
+
+The simplest and most effective way to do this is to allocate a
+contiguous block of UIDs, and create a single user named
+`xen-qemuuser-range-base` with the first UID.  For example, under Debian:
+
+    adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base
+
+An alternate way is to create 32752 distinct users with the name
+`xen-qemuuser-domid$domid`, doing something like the following:
+
+for ((i=1; i<=32751; i++))
  do
-    adduser --no-create-home --system xen-qemuuser-domid$i
+    adduser --no-create-home --system --uid $(($i-1+65536)) 
xen-qemuuser-domid$i
  done
-You might want to consider passing --group to adduser to create a new
-group for each new user.
+FIXME: Test the above script to see if it works
+
+NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up
+to 2^31 (or 2^32 if uids are unsigned).  POSIX only guarantees 16-bit
+UIDs however.  UID 65535 is reserved for an invalid value, and 65534
+is normally allocated to "nobody".
+
+Another, less-secure way is to run all QEMUs as the same UID.  To do
+this, create a user named `xen-qemuuser-shared`; for example:
+
+    adduser --no-create-home --system xen-qemuuser-shared
+
+## Domain config changes
+
+The core domain config change is to add the following line to the
+domain configuration:
+
+    dm_restrict=1
+
+This will perform a number of restrictions, outlined below in the
+'Technical details' section.
+
+Remove non-functioning default features:
+
+    vga="none"

I'm not sure what this means?

+
+Other features expected not to work include:
+* Inserting a new cdrom while the guest is running (xl cdrom-insert)
+* migration / save / restore

The above two features could be made to work if the toolstack drives QEMU correctly.

+* PCI passthrough

This one requires a fair amount of Xen & QEMU changes to have a chance of working.

+
+# Technical details
+
+## Restrictions done
+
+### Having qemu switch user
+
+'''Description''': As mentioned above, having qemu switch to a non-root user, 
one per
+domain id.
+
+'''Implementation''': The toolstack adds the following to the qemu 
command-line:
+
+    -runas <uid>:<gid>
+
+'''Testing Status''': Not tested
+
+### Xen restrictions
+
+'''Description''': Close and restrict Xen-related file descriptors.
+Specifically, make sure that only one `privcmd` instance is open, and
+that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called.

Just to clarify, we call IOCTL_PRIVCMD_RESTRICT on the `privcmd` fds and IOCTL_EVTCHN_RESTRICT_DOMID on the evtchn fds which remain open. There is no requirement to have only one instance of each.

+
+XXX Also, make sure that only one `xenstore` fd remains open, and that
+it's restricted.

The current implementation closes _all_ xenstore fds and doesn't need to make use of xenstore after going into restricted mode.

+
+'''Implementation''': Toolstack adds the following to the qemu command-line:
+
+-xen-domid-restrict
+
+'''Testing status''': Not tested XXX
+
+## Restrictions still to do
+
+### Chroot
+
+'''Description''': Qemu runs in its own chroot, such that even if it
+could call an 'open' command of some sort, there would be nothing for
+it to see.
+
+'''Implementation''': The toolstack creates a directory such as:
+`/var/run/qemu/root-<domid>`
+
+Then add the following to the qemu command-line:
+
+    -chroot /var/run/qemu/root-<domid>
+
+### Namespaces for unused functionality
+
+'''Descripiton''': Enter QEMU into its own mount & IPC namespaces.

Spelling: Descripiton

+This means that even if other restrictions fail, the process won't be
+able to even name system mount points or exsting non-file-based IPC
+descriptors to attempt to attack them.
+
+'''Implementation''':
+
+In theory this could be done in QEMU (similar to -sandbox, -runas,
+-chroot, and so on), but a patch doing this in QEMU was NAKed
+upstream. They preferred that this was done as a setup step by
+whatever executes QEMU; i.e., have the process which exec's QEMU first
+call:
+
+    unshare(CLONE_NEWNS | CLONE_NEWIPC)
+
+### seccomp filtering
+
+'''Description''': Turn on seccomp filtering to disable syscalls which
+QEMU doesn't need:
+
+'''Implementation''': Enable from the command-line:
+
+    -sandbox 
on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny
+
+`elevateprivileges` is currently required to allow `-runas` to work.
+Removing this requirement would mean making sure that the uid change
+happened before the seccomp2 call, perhaps by changing the uid before
+executing QEMU.  (But this would then require other changes to create
+the QMP socket, VNC socket, and so on).
+
+### Basic RLIMITs
+
+'''Description''': A number of limits on the resources that a given
+process / userid is allowed to consume.  These can limit the ability
+of a compromised QEMU process to DoS domain 0 by exhausting various
+resources available to it.
+
+'''Implementaiton'''

Spelling: Implementaiton

+
+Limits that can be implemented immediately without much effort:
+ - RLIMIT_FSIZE (file size): 256KiB
+
+Probably not necessary but why not:
+ - RLIMIT_CORE: 0
+ - RLIMIT_MSGQUEUE: 0
+ - RLIMIT_LOCKS: 0 XXX Check
+ - RLIMIT_MEMLOCK: 0
+   mlock() is Used only when both "realtime" and "mlock" are specified.
+
+### Further RLIMITs
+
+RLIMIT_AS limits the total amount of memory; but this includes the
+virtual memory which QEMU uses as a mapcache.  xen-mapcache.c already
+fiddles with this; it would be straightforward to make it *set* the
+rlimit to what it thinks a sensible limit is.
+
+Other things that would take some cleverness / changes to QEMU to
+utilize due to ordering constrants:
+ - RLIMIT_NPROC (after uid changes to a unique uid)
+ - RLIMIT_NOFILES (after all necessary files are opened)
+
+### libxl UID cleanup
+
+'''Description''': Domain IDs are reused, and thus restricted UIDs are
+reused.  If a compromised QEMU can fork (due to seccomp or
+RLIMIT_NPROC limits being ineffective for some reason), it may avoid
+being killed when its domain dies, then wait until the domain ID is
+reused again, at which point it will have control over the domain in
+question (which probably belongs to someone else).
+
+libxl should kill all UIDs associated with a domain both when the VM
+is destroyed, and before starting a VM with the same UID.
+
+'''Implementation''': Needs to be researched; it's difficult to do in
+a way that's not racy (e.g., we can't simply look at all processes,
+find the pids corresponding to uids, and then kill those, as a
+continually forking process could (potentially) elude this process.
+Rumor has it there's a "kill all processes with my UID" system call,
+or something of that nature.
+
+kill(-1,sig) sends a signal to "every process to which the calling
+process has permission to send a signal".  So in theory:
+  setuid(X)
+  kill(-1,KILL)
+should do the trick.
+
+### Disks
+
+The chroot (and seccomp?) happens late enough such that QEMU can
+initialize itself and open its disks. If you want to add a disk at run
+time via or insert a CD, you can't pass a path because QEMU is
+chrooted. Instead use the add-fd QMP command and use
+/dev/fdset/<fdset-id> as the path.
+
+A further layer of restriction could be to set RLIMIT_NOFILES to '0',
+and hand all disks over QMP.
+
+## Migration
+
+When calling xen-save-devices-state, since QEMU is running in a chroot
+it is not useful to pass a filename (it doesn't even have write access
+inside the chroot). Instead, give it an open fd using the add-fd
+mechanism.
+
+### Network namespacing
+
+Enter QEMU into its own network namespace (in addition to mount & IPC
+namespaces).  Basically change the 'unshare' call to be as follows:
+
+    unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC)

It might be clearer if this was merged with the other Namespacing section or at least put immediately afterwards.

+
+### Network
+If QEMU runs in its own network namespace, it can't open the tap
+device itself because the interface won't be visible outside of its
+own namespace. So instead, have the toolstack open the device and pass
+it as an fd on the command-line:
-2) a user named "xen-qemuuser-shared"
-As a fall back if both 1) fails, libxl will use a single user for
-all QEMU instances. The user is named xen-qemuuser-shared. This is
-less secure but still better than running QEMU as root. Using this is as
-simple as creating just one more user on your host:
+    -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>
-adduser --no-create-home --system xen-qemuuser-shared
+### VNC
+If QEMU runs in its own network namespace, it is not straightforward
+to listen on a TCP socket outside of its own network namespace. One
+option would be to use VNC over a UNIX socket:
-3) root
-As a last resort, libxl will start QEMU as root.
+    -vnc unix:/var/run/xen/vnc-<domid>
+However, this would break functionality in the general case; I think
+we need to have the toolstack open a socket and pass the fd to QEMU
+(which requires changes to QEMU).
-Please note that running QEMU as non-root causes several features like
-migration and PCI passthrough to not work properly and may prevent the guest
-from booting.


Although there are still a lot of todos, this looks generally good and is a big improvement on the previous document.

Cheers,
--
Ross Lagerwall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.