[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] xen/arm: Convert runstate address during hypcall



On Fri, 31 Jul 2020, Bertrand Marquis wrote:
> Sorry missed some points in my previous answer.
> 
> > On 30 Jul 2020, at 22:50, Julien Grall <julien@xxxxxxx> wrote:
> > 
> > Hi Bertrand,
> > 
> > To avoid extra work on your side, I would recommend to wait a bit before 
> > sending a new version. It would be good to at least settle the conversation 
> > in v2 regarding the approach taken.
> > 
> > On 30/07/2020 11:24, Bertrand Marquis wrote:
> >> At the moment on Arm, a Linux guest running with KTPI enabled will
> >> cause the following error when a context switch happens in user mode:
> >> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0
> >> The error is caused by the virtual address for the runstate area
> >> registered by the guest only being accessible when the guest is running
> >> in kernel space when KPTI is enabled.
> >> To solve this issue, this patch is doing the translation from virtual
> >> address to physical address during the hypercall and mapping the
> >> required pages using vmap. This is removing the conversion from virtual
> >> to physical address during the context switch which is solving the
> >> problem with KPTI.
> > 
> > To echo what Jan said on the previous version, this is a change in a stable 
> > ABI and therefore may break existing guest. FAOD, I agree in principle with 
> > the idea. However, we want to explain why breaking the ABI is the *only* 
> > viable solution.
> > 
> > From my understanding, it is not possible to fix without an ABI breakage 
> > because the hypervisor doesn't know when the guest will switch back from 
> > userspace to kernel space. The risk is the information provided by the 
> > runstate wouldn't contain accurate information and could affect how the 
> > guest handle stolen time.
> > 
> > Additionally there are a few issues with the current interface:
> >   1) It is assuming the virtual address cannot be re-used by the userspace. 
> > Thanksfully Linux have a split address space. But this may change with KPTI 
> > in place.
> >   2) When update the page-tables, the guest has to go through an invalid 
> > mapping. So the translation may fail at any point.
> > 
> > IOW, the existing interface can lead to random memory corruption and 
> > inacurracy of the stolen time.
> > 
> >> This is done only on arm architecture, the behaviour on x86 is not
> >> modified by this patch and the address conversion is done as before
> >> during each context switch.
> >> This is introducing several limitations in comparison to the previous
> >> behaviour (on arm only):
> >> - if the guest is remapping the area at a different physical address Xen
> >> will continue to update the area at the previous physical address. As
> >> the area is in kernel space and usually defined as a global variable this
> >> is something which is believed not to happen. If this is required by a
> >> guest, it will have to call the hypercall with the new area (even if it
> >> is at the same virtual address).
> >> - the area needs to be mapped during the hypercall. For the same reasons
> >> as for the previous case, even if the area is registered for a different
> >> vcpu. It is believed that registering an area using a virtual address
> >> unmapped is not something done.
> > 
> > This is not clear whether the virtual address refer to the current vCPU or 
> > the vCPU you register the runstate for. From the past discussion, I think 
> > you refer to the former. It would be good to clarify.
> > 
> > Additionally, all the new restrictions should be documented in the public 
> > interface. So an OS developper can find the differences between the 
> > architectures.
> > 
> > To answer Jan's concern, we certainly don't know all the guest OSes 
> > existing, however we also need to balance the benefit for a large majority 
> > of the users.
> > 
> > From previous discussion, the current approach was deemed to be acceptable 
> > on Arm and, AFAICT, also x86 (see [1]).
> > 
> > TBH, I would rather see the approach to be common. For that, we would an 
> > agreement from Andrew and Jan in the approach here. Meanwhile, I think this 
> > is the best approach to address the concern from Arm users.
> > 
> >> inline functions in headers could not be used as the architecture
> >> domain.h is included before the global domain.h making it impossible
> >> to use the struct vcpu inside the architecture header.
> >> This should not have any performance impact as the hypercall is only
> >> called once per vcpu usually.
> >> Signed-off-by: Bertrand Marquis <bertrand.marquis@xxxxxxx>
> >> ---
> >>   Changes in v2
> >>     - use vmap to map the pages during the hypercall.
> >>     - reintroduce initial copy during hypercall.
> >>   Changes in v3
> >>     - Fix Coding style
> >>     - Fix vaddr printing on arm32
> >>     - use write_atomic to modify state_entry_time update bit (only
> >>     in guest structure as the bit is not used inside Xen copy)
> >> ---
> >>  xen/arch/arm/domain.c        | 161 ++++++++++++++++++++++++++++++-----
> >>  xen/arch/x86/domain.c        |  29 ++++++-
> >>  xen/arch/x86/x86_64/domain.c |   4 +-
> >>  xen/common/domain.c          |  19 ++---
> >>  xen/include/asm-arm/domain.h |   9 ++
> >>  xen/include/asm-x86/domain.h |  16 ++++
> >>  xen/include/xen/domain.h     |   5 ++
> >>  xen/include/xen/sched.h      |  16 +---
> >>  8 files changed, 206 insertions(+), 53 deletions(-)
> >> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> >> index 31169326b2..8b36946017 100644
> >> --- a/xen/arch/arm/domain.c
> >> +++ b/xen/arch/arm/domain.c
> >> @@ -19,6 +19,7 @@
> >>  #include <xen/sched.h>
> >>  #include <xen/softirq.h>
> >>  #include <xen/wait.h>
> >> +#include <xen/vmap.h>
> >>    #include <asm/alternative.h>
> >>  #include <asm/cpuerrata.h>
> >> @@ -275,36 +276,156 @@ static void ctxt_switch_to(struct vcpu *n)
> >>      virt_timer_restore(n);
> >>  }
> >>  -/* Update per-VCPU guest runstate shared memory area (if registered). */
> >> -static void update_runstate_area(struct vcpu *v)
> >> +static void cleanup_runstate_vcpu_locked(struct vcpu *v)
> >>  {
> >> -    void __user *guest_handle = NULL;
> >> +    if ( v->arch.runstate_guest )
> >> +    {
> >> +        vunmap((void *)((unsigned long)v->arch.runstate_guest & 
> >> PAGE_MASK));
> >> +
> >> +        put_page(v->arch.runstate_guest_page[0]);
> >> +
> >> +        if ( v->arch.runstate_guest_page[1] )
> >> +            put_page(v->arch.runstate_guest_page[1]);
> >> +
> >> +        v->arch.runstate_guest = NULL;
> >> +    }
> >> +}
> >> +
> >> +void arch_vcpu_cleanup_runstate(struct vcpu *v)
> >> +{
> >> +    spin_lock(&v->arch.runstate_guest_lock);
> >> +
> >> +    cleanup_runstate_vcpu_locked(v);
> >> +
> >> +    spin_unlock(&v->arch.runstate_guest_lock);
> >> +}
> >> +
> >> +static int setup_runstate_vcpu_locked(struct vcpu *v, vaddr_t vaddr)
> >> +{
> >> +    unsigned int offset;
> >> +    mfn_t mfn[2];
> >> +    struct page_info *page;
> >> +    unsigned int numpages;
> >>      struct vcpu_runstate_info runstate;
> >> +    void *p;
> >>  -    if ( guest_handle_is_null(runstate_guest(v)) )
> >> -        return;
> >> +    /* user can pass a NULL address to unregister a previous area */
> >> +    if ( vaddr == 0 )
> >> +        return 0;
> >> +
> >> +    offset = vaddr & ~PAGE_MASK;
> >> +
> >> +    /* provided address must be aligned to a 64bit */
> >> +    if ( offset % alignof(struct vcpu_runstate_info) )
> > 
> > This new restriction wants to be explained in the commit message and public 
> > header.
> 
> ok
> 
> > 
> >> +    {
> >> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 
> >> 0x%"PRIvaddr
> >> +                ": Invalid offset\n", vaddr);
> > 
> > We usually enforce 80 character per lines except for format string. So it 
> > is easier to grep them in the code.
> 
> Ok i will fix this one and the following ones.
> But here PRIvaddr would break any attempt to grep something in fact.
> 
> > 
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    page = get_page_from_gva(v, vaddr, GV2M_WRITE);
> >> +    if ( !page )
> >> +    {
> >> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 
> >> 0x%"PRIvaddr
> >> +                ": Page is not mapped\n", vaddr);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    mfn[0] = page_to_mfn(page);
> >> +    v->arch.runstate_guest_page[0] = page;
> >> +
> >> +    if ( offset > (PAGE_SIZE - sizeof(struct vcpu_runstate_info)) )
> >> +    {
> >> +        /* guest area is crossing pages */
> >> +        page = get_page_from_gva(v, vaddr + PAGE_SIZE, GV2M_WRITE);
> >> +        if ( !page )
> >> +        {
> >> +            put_page(v->arch.runstate_guest_page[0]);
> >> +            gprintk(XENLOG_WARNING,
> >> +                    "Cannot map runstate pointer at 0x%"PRIvaddr
> >> +                    ": 2nd Page is not mapped\n", vaddr);
> >> +            return -EINVAL;
> >> +        }
> >> +        mfn[1] = page_to_mfn(page);
> >> +        v->arch.runstate_guest_page[1] = page;
> >> +        numpages = 2;
> >> +    }
> >> +    else
> >> +    {
> >> +        v->arch.runstate_guest_page[1] = NULL;
> >> +        numpages = 1;
> >> +    }
> >>  -    memcpy(&runstate, &v->runstate, sizeof(runstate));
> >> +    p = vmap(mfn, numpages);
> >> +    if ( !p )
> >> +    {
> >> +        put_page(v->arch.runstate_guest_page[0]);
> >> +        if ( numpages == 2 )
> >> +            put_page(v->arch.runstate_guest_page[1]);
> >>  -    if ( VM_ASSIST(v->domain, runstate_update_flag) )
> >> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 
> >> 0x%"PRIvaddr
> >> +                ": vmap error\n", vaddr);
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    v->arch.runstate_guest = p + offset;
> >> +
> >> +    if (v == current)
> >> +        memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate));
> >> +    else
> >>      {
> >> -        guest_handle = &v->runstate_guest.p->state_entry_time + 1;
> >> -        guest_handle--;
> >> -        runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
> >> -        __raw_copy_to_guest(guest_handle,
> >> -                            (void *)(&runstate.state_entry_time + 1) - 1, 
> >> 1);
> >> -        smp_wmb();
> >> +        vcpu_runstate_get(v, &runstate);
> >> +        memcpy(v->arch.runstate_guest, &runstate, sizeof(v->runstate));
> >>      }
> >>  -    __copy_to_guest(runstate_guest(v), &runstate, 1);
> >> +    return 0;
> >> +}
> >> +
> >> +int arch_vcpu_setup_runstate(struct vcpu *v,
> >> +                             struct vcpu_register_runstate_memory_area 
> >> area)
> >> +{
> >> +    int rc;
> >> +
> >> +    spin_lock(&v->arch.runstate_guest_lock);
> >> +
> >> +    /* cleanup if we are recalled */
> >> +    cleanup_runstate_vcpu_locked(v);
> >> +
> >> +    rc = setup_runstate_vcpu_locked(v, (vaddr_t)area.addr.v);
> >> +
> >> +    spin_unlock(&v->arch.runstate_guest_lock);
> >>  -    if ( guest_handle )
> >> +    return rc;
> >> +}
> >> +
> >> +
> >> +/* Update per-VCPU guest runstate shared memory area (if registered). */
> >> +static void update_runstate_area(struct vcpu *v)
> >> +{
> >> +    spin_lock(&v->arch.runstate_guest_lock);
> >> +
> >> +    if ( v->arch.runstate_guest )
> >>      {
> >> -        runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
> >> -        smp_wmb();
> >> -        __raw_copy_to_guest(guest_handle,
> >> -                            (void *)(&runstate.state_entry_time + 1) - 1, 
> >> 1);
> >> +        if ( VM_ASSIST(v->domain, runstate_update_flag) )
> >> +        {
> >> +            v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
> >> +            write_atomic(&(v->arch.runstate_guest->state_entry_time),
> >> +                    v->runstate.state_entry_time);
> > 
> > NIT: You want to indent v-> at the same level as the argument from the 
> > first line.
> 
> Ok
> 
> > 
> > Also, I think you are missing a smp_wmb() here.
> 
> The atomic operation itself would not need a barrier.
> I do not see why you think a barrier is needed here.
> For the internal structure ?

We need to make sure the other-end sees the XEN_RUNSTATE_UPDATE change
before other changes. Otherwise, due to cpu reordering, the writes could
be seen in reverse order. (Technically the reader would have to use a
read-barrier but that's a separate topic.)



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.