[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86: Meltdown band-aid against malicious 64-bit PV guests



>>> On 13.01.18 at 18:48, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 12/01/18 10:19, Jan Beulich wrote:
>> This is a very simplistic change limiting the amount of memory a running
>> 64-bit PV guest has mapped (and hence available for attacking): Only the
>> mappings of stack, IDT, and TSS are being cloned from the direct map
>> into per-CPU page tables. Guest controlled parts of the page tables are
>> being copied into those per-CPU page tables upon entry into the guest.
>> Cross-vCPU synchronization of top level page table entry changes is
>> being effected by forcing other active vCPU-s of the guest into the
>> hypervisor.
>>
>> The change to context_switch() isn't strictly necessary, but there's no
>> reason to keep switching page tables once a PV guest is being scheduled
>> out.
>>
>> There is certainly much room for improvement, especially of performance,
>> here - first and foremost suppressing all the negative effects on AMD
>> systems. But in the interest of backportability (including to really old
>> hypervisors, which may not even have alternative patching) any such is
>> being left out here.
>>
>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> 
> I would do with the answer to my question at the end before completing a
> review.  In the meantime, some observations.
> 
>> ---
>> TBD: Is forcing an event check interrupt for synchronization purposes
>> enough? It may be necessary to actually wait for remote vCPU-s to have
>> touched into the hypervisor, in which case a function-call-IPI should be
>> sent, with an empty handler (a flush-IPI with zero operation mask would
>> also do). Otoh, if the vCPU isn't already in hypervisor context,
>> delivery of the IPI should be almost instantly (as interrupts are always
>> enabled while in guest mode).
> 
> From a vcpu consistency point of view, once the hypercall making this
> change returns, no other vcpus should have executed an instruction with
> a stale view of the L4.
> 
> Therefore, I think you need to wait until the IPI has at least called
> into hypervisor context before releasing the current vcpu, safe in the
> knowledge that the update will be picked up on the way back out.

Okay, I'll switch to an empty-mask flush IPI then, unless you strongly
think the call-func one would be better.

>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -3683,6 +3683,20 @@ long do_mmu_update(
>>                          break;
>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> +                    if ( !rc )
> 
> Perhaps && (d->max_vcpus > 1) ?

Could do, but the removal of the CPU we're running on (below) will
take care of avoiding the IPI anyway. I would prefer to not put in
redundant conditions, but if you strongly feel about this, I can add
it.

>> +                    {
>> +                        /*
>> +                         * Force other vCPU-s of the affected guest to pick 
>> up
>> +                         * the change (if any).
>> +                         */
>> +                        unsigned int cpu = smp_processor_id();
>> +                        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
>> +
>> +                        cpumask_andnot(mask, pt_owner->domain_dirty_cpumask,
>> +                                       cpumask_of(cpu));
> 
> cpumask_copy() and __clear_bit(, cpu) is probably faster?

I'm not sure, and hence I'd prefer to keep it this way. But again,
if you feel strongly, I can make the change.

>> +                        if ( !cpumask_empty(mask) )
>> +                            smp_send_event_check_mask(mask);
>> +                    }
> 
> In terms of performance, if this shadowing/sync algorithm is correct,
> then it would be better to defer the IPI until after the update loop. 
> We only need force other vcpus once per mmu_update hypercall if there is
> an L4 update, rather than for each L4 update in the batch.

Oh, that's a very good point. Let me do that right away (together
with switching the IPI kind).

>> @@ -633,6 +636,181 @@ void cpu_exit_clear(unsigned int cpu)
>>      set_cpu_state(CPU_STATE_DEAD);
>>  }
>>  
>> +static bool clone_mapping(const void *ptr, root_pgentry_t *rpt)
> 
> Could we introduce these functions with ints and use -ENOMEM?

No problem.

>> +{
>> +    unsigned long linear = (unsigned long)ptr, pfn;
>> +    unsigned int flags;
>> +    l3_pgentry_t *l3t = 
>> l4e_to_l3e(idle_pg_table[root_table_offset(linear)]);
>> +    l2_pgentry_t *l2t;
>> +    l1_pgentry_t *l1t;
>> +
>> +    if ( linear < DIRECTMAP_VIRT_START )
>> +        return true;
>> +
>> +    flags = l3e_get_flags(l3t[l3_table_offset(linear)]);
>> +    ASSERT(flags & _PAGE_PRESENT);
>> +    if ( flags & _PAGE_PSE )
>> +    {
>> +        pfn = (l3e_get_pfn(l3t[l3_table_offset(linear)]) &
>> +               ~((1UL << (2 * PAGETABLE_ORDER)) - 1)) |
>> +              (PFN_DOWN(linear) & ((1UL << (2 * PAGETABLE_ORDER)) - 1));
> 
> This logic would be easier to read by having an extra
> 
> l3_pgentry_t *l3e = &l3t[l3_table_offset(linear)];
> 
> broken out.  Conversely, I can't think of a cleaner way to express the
> pfn calculation, despite the fact it is very complicated.

Will do.

>> +        flags &= ~_PAGE_PSE;
> 
> I presume we don't care for shuffling caching attributes?  This should
> only really be called on WB memory.

None of the mappings the function is being called for is other than
WB, and no domain (not even Dom0) can gain access to these
pages in order to fiddle with their cache attributes. In order to not
make the code less readable I've preferred to leave out respective
ASSERT()s.

>> +    }
>> +    else
>> +    {
>> +        l2t = l3e_to_l2e(l3t[l3_table_offset(linear)]);
>> +        flags = l2e_get_flags(l2t[l2_table_offset(linear)]);
>> +        ASSERT(flags & _PAGE_PRESENT);
>> +        if ( flags & _PAGE_PSE )
>> +        {
>> +            pfn = (l2e_get_pfn(l2t[l2_table_offset(linear)]) &
>> +                   ~((1UL << PAGETABLE_ORDER) - 1)) |
>> +                  (PFN_DOWN(linear) & ((1UL << PAGETABLE_ORDER) - 1));
>> +            flags &= ~_PAGE_PSE;
>> +        }
>> +        else
>> +        {
>> +            l1t = l2e_to_l1e(l2t[l2_table_offset(linear)]);
>> +            flags = l1e_get_flags(l1t[l1_table_offset(linear)]);
>> +            if ( !(flags & _PAGE_PRESENT) )
>> +                return true;
>> +            pfn = l1e_get_pfn(l1t[l1_table_offset(linear)]);
>> +        }
>> +    }
>> +
>> +    if ( !(root_get_flags(rpt[root_table_offset(linear)]) & _PAGE_PRESENT) )
>> +    {
>> +        l3t = alloc_xen_pagetable();
>> +        if ( !l3t )
>> +            return false;
>> +        clear_page(l3t);
>> +        l4e_write(&rpt[root_table_offset(linear)],
>> +                  l4e_from_paddr(__pa(l3t), __PAGE_HYPERVISOR));
>> +    }
>> +    else
>> +        l3t = l4e_to_l3e(rpt[root_table_offset(linear)]);
>> +
>> +    if ( !(l3e_get_flags(l3t[l3_table_offset(linear)]) & _PAGE_PRESENT) )
>> +    {
>> +        l2t = alloc_xen_pagetable();
>> +        if ( !l2t )
>> +            return false;
>> +        clear_page(l2t);
>> +        l3e_write(&l3t[l3_table_offset(linear)],
>> +                  l3e_from_paddr(__pa(l2t), __PAGE_HYPERVISOR));
>> +    }
>> +    else
>> +    {
>> +        ASSERT(!(l3e_get_flags(l3t[l3_table_offset(linear)]) & _PAGE_PSE));
>> +        l2t = l3e_to_l2e(l3t[l3_table_offset(linear)]);
>> +    }
>> +
>> +    if ( !(l2e_get_flags(l2t[l2_table_offset(linear)]) & _PAGE_PRESENT) )
>> +    {
>> +        l1t = alloc_xen_pagetable();
>> +        if ( !l1t )
>> +            return false;
>> +        clear_page(l1t);
>> +        l2e_write(&l2t[l2_table_offset(linear)],
>> +                  l2e_from_paddr(__pa(l1t), __PAGE_HYPERVISOR));
>> +    }
>> +    else
>> +    {
>> +        ASSERT(!(l2e_get_flags(l2t[l2_table_offset(linear)]) & _PAGE_PSE));
>> +        l1t = l2e_to_l1e(l2t[l2_table_offset(linear)]);
>> +    }
>> +
>> +    if ( l1e_get_flags(l1t[l1_table_offset(linear)]) & _PAGE_PRESENT )
>> +    {
>> +        ASSERT(l1e_get_pfn(l1t[l1_table_offset(linear)]) == pfn);
>> +        ASSERT(l1e_get_flags(l1t[l1_table_offset(linear)]) == flags);
> 
> Calculate l1e_from_pfn(pfn, flags) first and ASSERT() that the full PTE
> matches?

If this wasn't about ASSERT()s, I'd probably agree. It being
debugging code only, I think there's an advantage to having two
separate ASSERT() - you'll know at the first glance whether it
was the PFN or the flags that were wrong.

>> +    }
>> +    else
>> +        l1e_write(&l1t[l1_table_offset(linear)], l1e_from_pfn(pfn, flags));
>> +
>> +    return true;
>> +}
>> +
>> +DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
>> +
>> +static bool setup_cpu_root_pgt(unsigned int cpu)
>> +{
>> +    root_pgentry_t *rpt = alloc_xen_pagetable();
> 
> As an observation, alloc_xen_pagetable() should zero internally.  There
> is no circumstance under which we want to forget the clear_page().

Indeed I had noticed this too, and was planning to create a
respective follow-up patch.

> Another issue which I attempted to address in my series is that
> alloc_xen_pagetable() isn't numa-local.  Its not worth adjusting for
> backports, but it is something we should consider moving forwards.

I agree.

>> +    unsigned int off;
>> +
>> +    if ( !rpt )
>> +        return false;
>> +
>> +    clear_page(rpt);
>> +    per_cpu(root_pgt, cpu) = rpt;
>> +
>> +    rpt[root_table_offset(RO_MPT_VIRT_START)] =
>> +        idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
>> +    /* SH_LINEAR_PT inserted together with guest mappings. */
>> +    /* PERDOMAIN inserted during context switch. */
>> +    rpt[root_table_offset(XEN_VIRT_START)] =
>> +        idle_pg_table[root_table_offset(XEN_VIRT_START)];
>> +
>> +    /* Install direct map page table entries for stack, IDT, and TSS. */
>> +    for ( off = 0; off < STACK_SIZE; off += PAGE_SIZE )
>> +        if ( !clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt) )
>> +            break;
>> +
>> +    return off == STACK_SIZE &&
>> +           clone_mapping(idt_tables[cpu], rpt) &&
>> +           clone_mapping(&per_cpu(init_tss, cpu), rpt);
> 
> Can we put an outer set of brackets in, so editors retain the
> indentation like this?

Grumble, grumble - well, if you really want me too. If I saw
outer parentheses here when reviewing someone else's patch,
I'd comment exactly the other way around.

>> --- a/xen/include/asm-x86/current.h
>> +++ b/xen/include/asm-x86/current.h
>> @@ -41,6 +41,8 @@ struct cpu_info {
>>      struct vcpu *current_vcpu;
>>      unsigned long per_cpu_offset;
>>      unsigned long cr4;
>> +    unsigned long xen_cr3;
>> +    unsigned long pv_cr3;
> 
> These definitely need more description of how they work.

Will do.

> As far as I've reverse engineered, pv_cr3 is the paddr_t for per_pcu
> root pagetable, and is static after a cpu us up and operational.

It's intentionally _not_ a paddr_t, but a raw CR3 value. As you'll
note, both low and high bits are being stripped off of it before
using it for the copying.

> xen_cr3, if not 0, is a cr3 to restore on the way into the hypervisor?

With the caveat of it possibly being negative, which tells the
restore-to-Xen path to avoid restoring, while allowing entry
paths to still know what value to restore (the negated one
then). As agreed above, I'll try to put in a sensible comment.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.