Xen project Mailing List

Re: [Xen-devel] [PATCH 9/9] libxl: Kill QEMU with "reaper" ruid

To: Ian Jackson <Ian.Jackson@xxxxxxxxxx>

From: George Dunlap <George.Dunlap@xxxxxxxxxx>

Date: Wed, 28 Nov 2018 18:33:31 +0000

Accept-language: en-GB, en-US

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Wei Liu <wei.liu2@xxxxxxxxxx>

Delivery-date: Wed, 28 Nov 2018 18:34:56 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Thread-index: AQHUg1AZgOk0J7W6/kivG4TEwizLCaVlYOcAgAAZboA=

Thread-topic: [PATCH 9/9] libxl: Kill QEMU with "reaper" ruid

> On Nov 28, 2018, at 5:02 PM, Ian Jackson <ian.jackson@xxxxxxxxxx> wrote: > > George Dunlap writes ("[PATCH 9/9] libxl: Kill QEMU with "reaper" ruid"): >> Using kill(-1) to killing an untrusted dm process with the real uid >> equal to the dm_uid isn't guaranteed to succeed: the process in >> question may be able to kill the reaper process after the setresuid() >> and before the kill(). > ... >> +static int libxl__get_reaper_uid(libxl__gc *gc) >> +{ >> + struct passwd *user_base, user_pwbuf; >> + int ret; >> + ret = userlookup_helper_getpwnam(gc, LIBXL_QEMU_USER_RANGE_BASE, >> + &user_pwbuf, &user_base); > > ret should be r I think. > > Also I think you need to handle errors properly ? Ie set and check > errno. Don’t I want to pass up the errno values set by the getpwnam functions? Although I suppose I should also make sure that the uid I return is not zero... > >> const char *libxl__domain_device_model(libxl__gc *gc, >> const libxl_domain_build_info *info) >> { >> @@ -2719,12 +2728,62 @@ void libxl__destroy_device_model(libxl__egc *egc, > ... >> - ret = setresuid(dm_uid, dm_uid, 0); >> + fd = open(lockfile, O_RDWR|O_CREAT, 0666); >> + if (fd < 0) { >> + /* All other errno: EBADF, EINVAL, ENOLCK, EWOULDBLOCK */ >> + LOGED(ERROR, domid, >> + "unexpected error while trying to open lockfile %s, >> errno=%d", >> + lockfile, errno); >> + goto kill; > > More gotos! I doubt this error handling is right. > > I'm also not convinced that it is sensible to handle the case where we > have multiple per-domain uids but no reaper uid. This turns host > configuration errors into systems that are less secure than they > should be in a really obscure way. At this point, we have a target_uid but no way of getting a lock for reaper_uid. We have three options: 1. Don’t kill(-1) at all. 2. Try to kill(-1) with setresuid(target_uid, target_uid, 0) 3. kill(-1) with setresuid(reaper_uid, target_uid, 0) without holding the lock. #1 means that a rogue qemu will not be destroyed. #2 means that there’s a race, whereby sometimes the rogue qemu is destroyed, and sometimes the reaper process is destroyed by the rogue qemu first. #3 means there’s a race, whereby sometimes everything works fine, sometimes both the rogue qemu and *the reaper process from another domain* is destroyed, and sometimes this reaper process is killed by the reaper process from another domain (leaving the rogue qemu alive). I think #1 is obviously the best option. > >> + /* Try to lock the file */ >> + while (flock(fd, LOCK_EX)) { > > CODING_STYLE, no call inside the condition please. I c&p’d this from libxl_internal.c:libxl__lock_domain_userdata(). I take it you’re referring to the “ERROR HANDLING” section of CODING_STYLE? It wasn’t obvious to me that refers to while() loops. I take it you’d prefer "while(true) { rc = flock(); if(!rc) break; …}” instead? And if I may make a minor suggestion: This is the second time in this series you’ve said “don’t do X” for fairly common code idioms without giving me guidance as to what you’d like to see instead. > Overall I think this stuff needs a different error handling approach: > > * We should distinguish expected and reasonable configurations, where > we fall back to less secure methods, from other unexpected > situations. > > * In other unexpected situations (whether bad host configuration or > syscall errors or whatever) we should make a best effort to > destroy as much as we can. Which is what the code does. > > * But crucially in such situations (i) overall destroy ao should > return a failure error code (ii) the domain itself should not be > destroyed in Xen. This means that `xl destroy' fails, and can be > repeated after the problem is corrected. This means that in such a situation, we might successfully kill the devicemodel, but leave a zombie domain lying around. I suppose that might be the least-bad option, as 1) would be more likely to alert the administrator to fix the configuration error, and 2) the domain would hold the domid, such that any unkilled qemu processes wouldn’t be able to cause mischief on other domains. Probably some of these principles should be in a comment somewhere. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.