[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] docs/qemu-deprivilege: Revise and update with status and future plans
On 03/22/2018 06:24 PM, George Dunlap wrote: snip -for ((i=1; i<65536; i++)) +# Introduction + +# Setup + +## Getting the right versions of software + +Linux 4.XX (For dom0 kernel...) Requires 4.11 for the ability to restrict dmop calls. + +Xen 4.XX Requires 4.11 to get required dmop calls to make VGA work. + +Qemu: Requires patches not yet in any release + +## Setting up a userid range + +For maximum security, libxl needs to run the devicemodel for each +domain under a user id (UID) corresponding to its domain id. There +are 32752 possible domain IDs, and so libxl needs 32752 user ids set +aside for it. + +The simplest and most effective way to do this is to allocate a +contiguous block of UIDs, and create a single user named +`xen-qemuuser-range-base` with the first UID. For example, under Debian: + + adduser --no-create-home --uid 65536 --system xen-qemuuser-range-base + +An alternate way is to create 32752 distinct users with the name +`xen-qemuuser-domid$domid`, doing something like the following: + +for ((i=1; i<=32751; i++)) do - adduser --no-create-home --system xen-qemuuser-domid$i + adduser --no-create-home --system --uid $(($i-1+65536)) xen-qemuuser-domid$i done-You might want to consider passing --group to adduser to create a new-group for each new user. +FIXME: Test the above script to see if it works + +NOTE: Most modern systems have 32-bit UIDs, and so can in theory go up +to 2^31 (or 2^32 if uids are unsigned). POSIX only guarantees 16-bit +UIDs however. UID 65535 is reserved for an invalid value, and 65534 +is normally allocated to "nobody". + +Another, less-secure way is to run all QEMUs as the same UID. To do +this, create a user named `xen-qemuuser-shared`; for example: + + adduser --no-create-home --system xen-qemuuser-shared + +## Domain config changes + +The core domain config change is to add the following line to the +domain configuration: + + dm_restrict=1 + +This will perform a number of restrictions, outlined below in the +'Technical details' section. + +Remove non-functioning default features: + + vga="none" I'm not sure what this means? + +Other features expected not to work include: +* Inserting a new cdrom while the guest is running (xl cdrom-insert) +* migration / save / restore The above two features could be made to work if the toolstack drives QEMU correctly. +* PCI passthrough This one requires a fair amount of Xen & QEMU changes to have a chance of working. + +# Technical details + +## Restrictions done + +### Having qemu switch user + +'''Description''': As mentioned above, having qemu switch to a non-root user, one per +domain id. + +'''Implementation''': The toolstack adds the following to the qemu command-line: + + -runas <uid>:<gid> + +'''Testing Status''': Not tested + +### Xen restrictions + +'''Description''': Close and restrict Xen-related file descriptors. +Specifically, make sure that only one `privcmd` instance is open, and +that the IOCTL_EVTCHN_RESTRICT_DOMID ioctl has been called. Just to clarify, we call IOCTL_PRIVCMD_RESTRICT on the `privcmd` fds and IOCTL_EVTCHN_RESTRICT_DOMID on the evtchn fds which remain open. There is no requirement to have only one instance of each. + +XXX Also, make sure that only one `xenstore` fd remains open, and that +it's restricted. The current implementation closes _all_ xenstore fds and doesn't need to make use of xenstore after going into restricted mode. + +'''Implementation''': Toolstack adds the following to the qemu command-line: + +-xen-domid-restrict + +'''Testing status''': Not tested XXX + +## Restrictions still to do + +### Chroot + +'''Description''': Qemu runs in its own chroot, such that even if it +could call an 'open' command of some sort, there would be nothing for +it to see. + +'''Implementation''': The toolstack creates a directory such as: +`/var/run/qemu/root-<domid>` + +Then add the following to the qemu command-line: + + -chroot /var/run/qemu/root-<domid> + +### Namespaces for unused functionality + +'''Descripiton''': Enter QEMU into its own mount & IPC namespaces. Spelling: Descripiton +This means that even if other restrictions fail, the process won't be +able to even name system mount points or exsting non-file-based IPC +descriptors to attempt to attack them. + +'''Implementation''': + +In theory this could be done in QEMU (similar to -sandbox, -runas, +-chroot, and so on), but a patch doing this in QEMU was NAKed +upstream. They preferred that this was done as a setup step by +whatever executes QEMU; i.e., have the process which exec's QEMU first +call: + + unshare(CLONE_NEWNS | CLONE_NEWIPC) + +### seccomp filtering + +'''Description''': Turn on seccomp filtering to disable syscalls which +QEMU doesn't need: + +'''Implementation''': Enable from the command-line: + + -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=deny,resourcecontrol=deny + +`elevateprivileges` is currently required to allow `-runas` to work. +Removing this requirement would mean making sure that the uid change +happened before the seccomp2 call, perhaps by changing the uid before +executing QEMU. (But this would then require other changes to create +the QMP socket, VNC socket, and so on). + +### Basic RLIMITs + +'''Description''': A number of limits on the resources that a given +process / userid is allowed to consume. These can limit the ability +of a compromised QEMU process to DoS domain 0 by exhausting various +resources available to it. + +'''Implementaiton''' Spelling: Implementaiton + +Limits that can be implemented immediately without much effort: + - RLIMIT_FSIZE (file size): 256KiB + +Probably not necessary but why not: + - RLIMIT_CORE: 0 + - RLIMIT_MSGQUEUE: 0 + - RLIMIT_LOCKS: 0 XXX Check + - RLIMIT_MEMLOCK: 0 + mlock() is Used only when both "realtime" and "mlock" are specified. + +### Further RLIMITs + +RLIMIT_AS limits the total amount of memory; but this includes the +virtual memory which QEMU uses as a mapcache. xen-mapcache.c already +fiddles with this; it would be straightforward to make it *set* the +rlimit to what it thinks a sensible limit is. + +Other things that would take some cleverness / changes to QEMU to +utilize due to ordering constrants: + - RLIMIT_NPROC (after uid changes to a unique uid) + - RLIMIT_NOFILES (after all necessary files are opened) + +### libxl UID cleanup + +'''Description''': Domain IDs are reused, and thus restricted UIDs are +reused. If a compromised QEMU can fork (due to seccomp or +RLIMIT_NPROC limits being ineffective for some reason), it may avoid +being killed when its domain dies, then wait until the domain ID is +reused again, at which point it will have control over the domain in +question (which probably belongs to someone else). + +libxl should kill all UIDs associated with a domain both when the VM +is destroyed, and before starting a VM with the same UID. + +'''Implementation''': Needs to be researched; it's difficult to do in +a way that's not racy (e.g., we can't simply look at all processes, +find the pids corresponding to uids, and then kill those, as a +continually forking process could (potentially) elude this process. +Rumor has it there's a "kill all processes with my UID" system call, +or something of that nature. + +kill(-1,sig) sends a signal to "every process to which the calling +process has permission to send a signal". So in theory: + setuid(X) + kill(-1,KILL) +should do the trick. + +### Disks + +The chroot (and seccomp?) happens late enough such that QEMU can +initialize itself and open its disks. If you want to add a disk at run +time via or insert a CD, you can't pass a path because QEMU is +chrooted. Instead use the add-fd QMP command and use +/dev/fdset/<fdset-id> as the path. + +A further layer of restriction could be to set RLIMIT_NOFILES to '0', +and hand all disks over QMP. + +## Migration + +When calling xen-save-devices-state, since QEMU is running in a chroot +it is not useful to pass a filename (it doesn't even have write access +inside the chroot). Instead, give it an open fd using the add-fd +mechanism. + +### Network namespacing + +Enter QEMU into its own network namespace (in addition to mount & IPC +namespaces). Basically change the 'unshare' call to be as follows: + + unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC) It might be clearer if this was merged with the other Namespacing section or at least put immediately afterwards. + +### Network+If QEMU runs in its own network namespace, it can't open the tap+device itself because the interface won't be visible outside of its +own namespace. So instead, have the toolstack open the device and pass +it as an fd on the command-line:-2) a user named "xen-qemuuser-shared"-As a fall back if both 1) fails, libxl will use a single user for -all QEMU instances. The user is named xen-qemuuser-shared. This is -less secure but still better than running QEMU as root. Using this is as -simple as creating just one more user on your host: + -device rtl8139,netdev=tapnet0,mac=... -netdev tap,id=tapnet0,fd=<tapfd>-adduser --no-create-home --system xen-qemuuser-shared+### VNC+If QEMU runs in its own network namespace, it is not straightforward+to listen on a TCP socket outside of its own network namespace. One +option would be to use VNC over a UNIX socket:-3) root-As a last resort, libxl will start QEMU as root. + -vnc unix:/var/run/xen/vnc-<domid>+However, this would break functionality in the general case; I think+we need to have the toolstack open a socket and pass the fd to QEMU +(which requires changes to QEMU).-Please note that running QEMU as non-root causes several features like-migration and PCI passthrough to not work properly and may prevent the guest -from booting. Although there are still a lot of todos, this looks generally good and is a big improvement on the previous document. Cheers, -- Ross Lagerwall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |