[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 9/9] libxl: Kill QEMU with "reaper" ruid

> On Nov 28, 2018, at 5:02 PM, Ian Jackson <ian.jackson@xxxxxxxxxx> wrote:
> George Dunlap writes ("[PATCH 9/9] libxl: Kill QEMU with "reaper" ruid"):
>> Using kill(-1) to killing an untrusted dm process with the real uid
>> equal to the dm_uid isn't guaranteed to succeed: the process in
>> question may be able to kill the reaper process after the setresuid()
>> and before the kill().
> ...
>> +static int libxl__get_reaper_uid(libxl__gc *gc)
>> +{
>> +    struct passwd *user_base, user_pwbuf;
>> +    int ret;
>> +    ret = userlookup_helper_getpwnam(gc, LIBXL_QEMU_USER_RANGE_BASE,
>> +                                         &user_pwbuf, &user_base);
> ret should be r I think.
> Also I think you need to handle errors properly ?  Ie set and check
> errno.

Don’t I want to pass up the errno values set by the getpwnam functions?

Although I suppose I should also make sure that the uid I return is not zero...

>> const char *libxl__domain_device_model(libxl__gc *gc,
>>                                        const libxl_domain_build_info *info)
>> {
>> @@ -2719,12 +2728,62 @@ void libxl__destroy_device_model(libxl__egc *egc,
> ...
>> -        ret = setresuid(dm_uid, dm_uid, 0);
>> +        fd = open(lockfile, O_RDWR|O_CREAT, 0666);
>> +        if (fd < 0) {
>> +            /* All other errno: EBADF, EINVAL, ENOLCK, EWOULDBLOCK */
>> +            LOGED(ERROR, domid,
>> +                  "unexpected error while trying to open lockfile %s, 
>> errno=%d",
>> +                  lockfile, errno);
>> +            goto kill;
> More gotos!  I doubt this error handling is right.
> I'm also not convinced that it is sensible to handle the case where we
> have multiple per-domain uids but no reaper uid.  This turns host
> configuration errors into systems that are less secure than they
> should be in a really obscure way.

At this point, we have a target_uid but no way of getting a lock for 
reaper_uid.  We have three options:

1. Don’t kill(-1) at all.
2. Try to  kill(-1) with setresuid(target_uid, target_uid, 0)
3. kill(-1) with setresuid(reaper_uid, target_uid, 0) without holding the lock.

#1 means that a rogue qemu will not be destroyed.

#2 means that there’s a race, whereby sometimes the rogue qemu is destroyed, 
and sometimes the reaper process is destroyed by the rogue qemu first.

#3 means there’s a race, whereby sometimes everything works fine, sometimes 
both the rogue qemu and *the reaper process from another domain* is destroyed, 
and sometimes this reaper process is killed by the reaper process from another 
domain (leaving the rogue qemu alive).

I think #1 is obviously the best option.

>> +        /* Try to lock the file */
>> +        while (flock(fd, LOCK_EX)) {
> CODING_STYLE, no call inside the condition please.

I c&p’d this from libxl_internal.c:libxl__lock_domain_userdata().

I take it you’re referring to the “ERROR HANDLING” section of CODING_STYLE?  It 
wasn’t obvious to me that refers to while() loops.

I take it you’d prefer "while(true) { rc = flock(); if(!rc) break; …}” instead?

And if I may make a minor suggestion: This is the second time in this series 
you’ve said “don’t do X” for fairly common code idioms without giving me 
guidance as to what you’d like to see instead. 

> Overall I think this stuff needs a different error handling approach:
> * We should distinguish expected and reasonable configurations, where
>   we fall back to less secure methods, from other unexpected
>   situations.
> * In other unexpected situations (whether bad host configuration or
>   syscall errors or whatever) we should make a best effort to
>   destroy as much as we can.

Which is what the code does.

> * But crucially in such situations (i) overall destroy ao should
>   return a failure error code (ii) the domain itself should not be
>   destroyed in Xen.  This means that `xl destroy' fails, and can be
>   repeated after the problem is corrected.

This means that in such a situation, we might successfully kill the 
devicemodel, but leave a zombie domain lying around.

I suppose that might be the least-bad option, as 1) would be more likely to 
alert the administrator to fix the configuration error, and 2) the domain would 
hold the domid, such that any unkilled qemu processes wouldn’t be able to cause 
mischief on other domains.

Probably some of these principles should be in a comment somewhere.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.