[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables



On 07/08/15 14:19, Ben Catterall wrote:
> On 06/08/15 20:52, Andrew Cooper wrote:
>> On 06/08/15 17:45, Ben Catterall wrote:
>>> The paging structure mappings for the deprivileged mode are added
>>> to the monitor page table for HVM guests. The entries are generated by
>>> walking the page tables and mapping in new pages. If a higher-level
>>> page table
>>> mapping exists, we attempt to traverse all the way down to a leaf
>>> page and add
>>> the page there. Access bits are flipped as needed.
>>>
>>> The page entries are generated for deprivileged .text, .data and a
>>> stack. The
>>> data is copied from sections allocated by the linker. The mappings
>>> are setup in
>>> an unused portion of the Xen virtual address space. The pages are
>>> mapped in as
>>> user mode accessible, with NX bits set for the data and stack
>>> regions and the
>>> code region is set to be executable and read-only.
>>>
>>> The needed pages are allocated on the HAP page heap and are
>>> deallocated when
>>> those heap pages are deallocated (on domain destruction).
>>>
>>> Signed-off-by: Ben Catterall <Ben.Catterall@xxxxxxxxxx>
>>> ---
>>>   xen/arch/x86/hvm/Makefile          |   1 +
>>>   xen/arch/x86/hvm/deprivileged.c    | 441
>>> +++++++++++++++++++++++++++++++++++++
>>>   xen/arch/x86/mm/hap/hap.c          |  10 +-
>>>   xen/arch/x86/xen.lds.S             |  19 ++
>>>   xen/include/asm-x86/config.h       |  10 +-
>>>   xen/include/xen/hvm/deprivileged.h |  94 ++++++++
>>>   6 files changed, 573 insertions(+), 2 deletions(-)
>>>   create mode 100644 xen/arch/x86/hvm/deprivileged.c
>>>   create mode 100644 xen/include/xen/hvm/deprivileged.h
>>>
>>> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
>>> index 794e793..bd83ba3 100644
>>> --- a/xen/arch/x86/hvm/Makefile
>>> +++ b/xen/arch/x86/hvm/Makefile
>>> @@ -16,6 +16,7 @@ obj-y += pmtimer.o
>>>   obj-y += quirks.o
>>>   obj-y += rtc.o
>>>   obj-y += save.o
>>> +obj-y += deprivileged.o
>>
>> We attempt to keep objects linked in alphabetical order, where possible.
>>
> apologies
>>>   obj-y += stdvga.o
>>>   obj-y += vioapic.o
>>>   obj-y += viridian.o
>>> diff --git a/xen/arch/x86/hvm/deprivileged.c
>>> b/xen/arch/x86/hvm/deprivileged.c
>>> new file mode 100644
>>> index 0000000..071d900
>>> --- /dev/null
>>> +++ b/xen/arch/x86/hvm/deprivileged.c
>>> @@ -0,0 +1,441 @@
>>> +/*
>>> + * HVM deprivileged mode to provide support for running operations in
>>> + * user mode from Xen
>>> + */
>>> +#include <xen/lib.h>
>>> +#include <xen/mm.h>
>>> +#include <xen/domain_page.h>
>>> +#include <xen/config.h>
>>> +#include <xen/types.h>
>>> +#include <xen/sched.h>
>>> +#include <asm/paging.h>
>>> +#include <xen/compiler.h>
>>> +#include <asm/hap.h>
>>> +#include <asm/paging.h>
>>> +#include <asm-x86/page.h>
>>> +#include <public/domctl.h>
>>> +#include <xen/domain_page.h>
>>> +#include <asm/hvm/vmx/vmx.h>
>>> +#include <xen/hvm/deprivileged.h>
>>> +
>>> +void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e)
>>> +{
>>> +    int ret;
>>> +    void *p;
>>> +    unsigned long int size;
>>> +
>>> +    /* Copy the .text segment for deprivileged code */
>>
>> Why do you need to copy the deprivileged text section at all?  It is
>> read only and completely common.  All you should need to do is make
>> userspace mappings to it where it resides in the Xen linked area.
>>
> tbh: At the time, I was unsure if it could open a hole if, when Xen
> switched over to superpages, and the deprivileged code was not aligned
> on a 4KB page boundary, then mapping it in might prove problematic.
> Mainly as we'd map in some of Xen's ring 0 code by accident, which
> could prove useful to an attacker as they could then call this from
> userspace.

The monitor tables are only used in the context of HVM guests.  PV
guests run in their own pagetables (including a subset of the idle
tables), but have no chance of accidentally gaining these mappings. (A
64bit PV kernel will get legitimately irate if part of it was replaced
with some hvm depriv code.)

Creating user mappings of Xen's .text section does allow a PV guest to
execute the code, but only at the current privilege of the PV guest. 
Therefore, while a bug, doesn't present an increased attack surface. 
(It might be interesting as a side channel to analyse for RoP purposes,
but not as a direct attack surface.)

>
> Though, now I've thought about it more, I can ensure in the linker
> that it's aligned properly  and I think map_pages_to_xen() will do
> what I need.

Playing with alignment is easy.

>
>>
>>> +           (unsigned long int)&__hvm_deprivileged_text_start;
>>> +
>>> +    ret = hvm_deprivileged_copy_pages(d, l4e,
>>> +                              (unsigned long
>>> int)&__hvm_deprivileged_text_start,
>>> +                              (unsigned long
>>> int)HVM_DEPRIVILEGED_TEXT_ADDR,
>>> +                              size, 0 /* No write */);
>>> +
>>> +    if( ret )
>>> +    {
>>> +        printk("HVM: Error when initialising depriv .text. Code:
>>> %d", ret);
>>> +        domain_crash(d);
>>> +        return;
>>> +    }
>>> +
>>> +    /* Copy the .data segment for ring3 code */
>>> +    size = (unsigned long int)&__hvm_deprivileged_data_end -
>>> +           (unsigned long int)&__hvm_deprivileged_data_start;
>>> +
>>> +    ret = hvm_deprivileged_copy_pages(d, l4e,
>>> +                              (unsigned long
>>> int)&__hvm_deprivileged_data_start,
>>> +                              (unsigned long
>>> int)HVM_DEPRIVILEGED_DATA_ADDR,
>>> +                              size, _PAGE_NX | _PAGE_RW);
>>
>> What do you expect to end up in a data section like this?
>>
> It's for passing in configuration data at the start when we run the
> device (reduce the number of system calls needed) and for local data
> which the emulated devices may need. It can also be used to put the
> results of any system calls that the deprivileged code may call.

You can achieve the same thing by making the "stacks" larger and
segmenting them.  The API could pass a pointer to the data in one of the
registers, and it just happens to be sheer coincidence that the "stack"
and "data area" happen to be adjacent.

>
>>> +}
>>> +
>>> +/* Create a copy of the data at the specified virtual address.
>>> + * When writing to the source, it will walk the page table
>>> hierarchy, creating
>>> + * new levels as needed.
>>> + *
>>> + * If we find a leaf entry in a page table (one which holds the
>>> + * mfn of a 4KB, 2MB, etc. page frame) which has already been
>>> + * mapped in, then we bail as we have a collision and this likely
>>> + * means a bug or the memory configuration has been changed.
>>> + *
>>> + * Pages have PAGE_USER, PAGE_GLOBAL (if supported) and
>>> PAGE_PRESENT set by
>>> + * default. The extra l1_flags are used for extra control e.g.
>>> PAGE_RW.
>>> + * The PAGE_RW flag will be enabled for all page structure entries
>>> + * above the leaf page if that leaf page has PAGE_RW set. This is
>>> needed to
>>> + * permit the writes on the leaf pages. See the Intel manual 3A
>>> section 4.6.
>>> + *
>>> + * TODO: We proceed down to L1 4KB pages and then map these in. We
>>> should
>>> + * stop the recursion on L3/L2 for a 1GB or 2MB page which would
>>> mean faster
>>> + * page access. When we stop would depend on size (e.g. use 2MB
>>> pages for a
>>> + * few MBs)
>>> + */
>>> +int hvm_deprivileged_copy_pages(struct domain *d,
>>> +                            l4_pgentry_t *l4t_base,
>>> +                            unsigned long int src_start,
>>> +                            unsigned long int dst_start,
>>> +                            unsigned long int size,
>>> +                            unsigned int l1_flags)
>>
>> <large snip>
>>
>> Does map_pages_to_xen() not do everything you need here?  It certainly
>> should be fine for the .text section, and will keep the idle and monitor
>> tables up to date for you.
>>
> As it updates the idle tables I haven't used it. From my
> understanding: those idle tables are also used by the PV guests when
> the guest is created as the idle table is copied from FIRST_XEN_SLOT
> up to the end of the table. In config.h, the region where I'm mapping
> in my new mode has been allocated for PV guest-defined usage. I can
> edit init_guest_l4_table so that it doesn't copy this last slot
> across. I don't know if this would have further repercussions.
>
> What would you suggest? Thanks!

The idle pagetables are Xens primary set of pagetables.  Globally common
stuff such the depriv .text should reside in them.

PV guest pagetables are constructed by memcpy()ing certain slots from
the idle pagetables, the vast majority of the virtual address space is
owned by the guest, so is untouched by Xen.

In this case, your new depriv extensions live in an area which unused
for HVM monitor tables, but would belong to a PV guest kernel.


However, map_pages_to_xen() isn't quite appropriate for making a set of
Xen pagetables for mappings to be hooked into monitor tables, so you are
going to have to keep the current allocation functions.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.