[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] xen/arm: Convert runstate address during hypcall


  • To: Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • From: Bertrand Marquis <Bertrand.Marquis@xxxxxxx>
  • Date: Fri, 14 Aug 2020 09:12:52 +0000
  • Accept-language: en-GB, en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=XNVUvGeqL65h8hoc3SNuFA9F1wP7HAY0b/e9pmSGbEI=; b=j4Cw5Qi4/mf/ZcNabsHZKMTsgyUXutstAbnCJWTn4gDKTxSDtEm1MxF9RteDniW7tPeooEBNOUHuJ214/PGE+qs7hzZbjUaCRCMvBGhJHmSxVCR+maYNTIePksPaDP0KCUXfD9R5D9pY6KNMGSCRCK6z2J6mqhmXm/TfJB8J3ufX6OljDFSxQPgjdzzEKaPF0RSKBpFeDPiXISLtzp3mtsrdf1ZPmFkWQe1QLoMks5bxNyedbc0fBRn38pMgwaCVo5WdZA4uBZhBMsr6j4xsrQcX/SV5XSxjserXzkw1VbU/THcnM3Bad4UZdWhAEZAIZPqvAWe/ZY/gJG6eO0K7cQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=SYJ33kFu91RHA/IsWk2sOnQuDmcB+EuCXpYhWtz+KddtClwZ0NgwykZo79CunW1Z8lZqbzRcFrgzWuyceYZtjNxybWvI7f93qybnnJ6YOAMOePtMjADnssQBmz+7VQqw7KeklOYlWVBSGyv+k3AIPyw0gaCMBbjL4nf4RJLi7KVwmkUeF3mLdd4O5nXt+2Ptq+GTfep5sKrEIWejR/Xn2cqRGjbN/EyUFpQ3iBoIJLVNa0Y/KP2HIy5P3a47a68uvNkjMhSKjA70hfyDAVWTdJU07lkVDzLEzSf/ztmvZJeLZdzfbJxOf1lrfjMhJ9ZO3FrQSkKEqSQh78fhPKCpIQ==
  • Authentication-results-original: kernel.org; dkim=none (message not signed) header.d=none;kernel.org; dmarc=none action=none header.from=arm.com;
  • Cc: Julien Grall <julien@xxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, nd <nd@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Fri, 14 Aug 2020 09:13:54 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Nodisclaimer: true
  • Original-authentication-results: kernel.org; dkim=none (message not signed) header.d=none;kernel.org; dmarc=none action=none header.from=arm.com;
  • Thread-index: AQHWZluWPRKsKyuZRkCQVDwO60tsvakgmVkAgAEWTwCAAKE8gIAVGJCA
  • Thread-topic: [PATCH v3] xen/arm: Convert runstate address during hypcall


> On 1 Aug 2020, at 00:03, Stefano Stabellini <sstabellini@xxxxxxxxxx> wrote:
> 
> On Fri, 31 Jul 2020, Bertrand Marquis wrote:
>> Sorry missed some points in my previous answer.
>> 
>>> On 30 Jul 2020, at 22:50, Julien Grall <julien@xxxxxxx> wrote:
>>> 
>>> Hi Bertrand,
>>> 
>>> To avoid extra work on your side, I would recommend to wait a bit before 
>>> sending a new version. It would be good to at least settle the conversation 
>>> in v2 regarding the approach taken.
>>> 
>>> On 30/07/2020 11:24, Bertrand Marquis wrote:
>>>> At the moment on Arm, a Linux guest running with KTPI enabled will
>>>> cause the following error when a context switch happens in user mode:
>>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0
>>>> The error is caused by the virtual address for the runstate area
>>>> registered by the guest only being accessible when the guest is running
>>>> in kernel space when KPTI is enabled.
>>>> To solve this issue, this patch is doing the translation from virtual
>>>> address to physical address during the hypercall and mapping the
>>>> required pages using vmap. This is removing the conversion from virtual
>>>> to physical address during the context switch which is solving the
>>>> problem with KPTI.
>>> 
>>> To echo what Jan said on the previous version, this is a change in a stable 
>>> ABI and therefore may break existing guest. FAOD, I agree in principle with 
>>> the idea. However, we want to explain why breaking the ABI is the *only* 
>>> viable solution.
>>> 
>>> From my understanding, it is not possible to fix without an ABI breakage 
>>> because the hypervisor doesn't know when the guest will switch back from 
>>> userspace to kernel space. The risk is the information provided by the 
>>> runstate wouldn't contain accurate information and could affect how the 
>>> guest handle stolen time.
>>> 
>>> Additionally there are a few issues with the current interface:
>>>  1) It is assuming the virtual address cannot be re-used by the userspace. 
>>> Thanksfully Linux have a split address space. But this may change with KPTI 
>>> in place.
>>>  2) When update the page-tables, the guest has to go through an invalid 
>>> mapping. So the translation may fail at any point.
>>> 
>>> IOW, the existing interface can lead to random memory corruption and 
>>> inacurracy of the stolen time.
>>> 
>>>> This is done only on arm architecture, the behaviour on x86 is not
>>>> modified by this patch and the address conversion is done as before
>>>> during each context switch.
>>>> This is introducing several limitations in comparison to the previous
>>>> behaviour (on arm only):
>>>> - if the guest is remapping the area at a different physical address Xen
>>>> will continue to update the area at the previous physical address. As
>>>> the area is in kernel space and usually defined as a global variable this
>>>> is something which is believed not to happen. If this is required by a
>>>> guest, it will have to call the hypercall with the new area (even if it
>>>> is at the same virtual address).
>>>> - the area needs to be mapped during the hypercall. For the same reasons
>>>> as for the previous case, even if the area is registered for a different
>>>> vcpu. It is believed that registering an area using a virtual address
>>>> unmapped is not something done.
>>> 
>>> This is not clear whether the virtual address refer to the current vCPU or 
>>> the vCPU you register the runstate for. From the past discussion, I think 
>>> you refer to the former. It would be good to clarify.
>>> 
>>> Additionally, all the new restrictions should be documented in the public 
>>> interface. So an OS developper can find the differences between the 
>>> architectures.
>>> 
>>> To answer Jan's concern, we certainly don't know all the guest OSes 
>>> existing, however we also need to balance the benefit for a large majority 
>>> of the users.
>>> 
>>> From previous discussion, the current approach was deemed to be acceptable 
>>> on Arm and, AFAICT, also x86 (see [1]).
>>> 
>>> TBH, I would rather see the approach to be common. For that, we would an 
>>> agreement from Andrew and Jan in the approach here. Meanwhile, I think this 
>>> is the best approach to address the concern from Arm users.
>>> 
>>>> inline functions in headers could not be used as the architecture
>>>> domain.h is included before the global domain.h making it impossible
>>>> to use the struct vcpu inside the architecture header.
>>>> This should not have any performance impact as the hypercall is only
>>>> called once per vcpu usually.
>>>> Signed-off-by: Bertrand Marquis <bertrand.marquis@xxxxxxx>
>>>> ---
>>>>  Changes in v2
>>>>    - use vmap to map the pages during the hypercall.
>>>>    - reintroduce initial copy during hypercall.
>>>>  Changes in v3
>>>>    - Fix Coding style
>>>>    - Fix vaddr printing on arm32
>>>>    - use write_atomic to modify state_entry_time update bit (only
>>>>    in guest structure as the bit is not used inside Xen copy)
>>>> ---
>>>> xen/arch/arm/domain.c        | 161 ++++++++++++++++++++++++++++++-----
>>>> xen/arch/x86/domain.c        |  29 ++++++-
>>>> xen/arch/x86/x86_64/domain.c |   4 +-
>>>> xen/common/domain.c          |  19 ++---
>>>> xen/include/asm-arm/domain.h |   9 ++
>>>> xen/include/asm-x86/domain.h |  16 ++++
>>>> xen/include/xen/domain.h     |   5 ++
>>>> xen/include/xen/sched.h      |  16 +---
>>>> 8 files changed, 206 insertions(+), 53 deletions(-)
>>>> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
>>>> index 31169326b2..8b36946017 100644
>>>> --- a/xen/arch/arm/domain.c
>>>> +++ b/xen/arch/arm/domain.c
>>>> @@ -19,6 +19,7 @@
>>>> #include <xen/sched.h>
>>>> #include <xen/softirq.h>
>>>> #include <xen/wait.h>
>>>> +#include <xen/vmap.h>
>>>>   #include <asm/alternative.h>
>>>> #include <asm/cpuerrata.h>
>>>> @@ -275,36 +276,156 @@ static void ctxt_switch_to(struct vcpu *n)
>>>>     virt_timer_restore(n);
>>>> }
>>>> -/* Update per-VCPU guest runstate shared memory area (if registered). */
>>>> -static void update_runstate_area(struct vcpu *v)
>>>> +static void cleanup_runstate_vcpu_locked(struct vcpu *v)
>>>> {
>>>> -    void __user *guest_handle = NULL;
>>>> +    if ( v->arch.runstate_guest )
>>>> +    {
>>>> +        vunmap((void *)((unsigned long)v->arch.runstate_guest & 
>>>> PAGE_MASK));
>>>> +
>>>> +        put_page(v->arch.runstate_guest_page[0]);
>>>> +
>>>> +        if ( v->arch.runstate_guest_page[1] )
>>>> +            put_page(v->arch.runstate_guest_page[1]);
>>>> +
>>>> +        v->arch.runstate_guest = NULL;
>>>> +    }
>>>> +}
>>>> +
>>>> +void arch_vcpu_cleanup_runstate(struct vcpu *v)
>>>> +{
>>>> +    spin_lock(&v->arch.runstate_guest_lock);
>>>> +
>>>> +    cleanup_runstate_vcpu_locked(v);
>>>> +
>>>> +    spin_unlock(&v->arch.runstate_guest_lock);
>>>> +}
>>>> +
>>>> +static int setup_runstate_vcpu_locked(struct vcpu *v, vaddr_t vaddr)
>>>> +{
>>>> +    unsigned int offset;
>>>> +    mfn_t mfn[2];
>>>> +    struct page_info *page;
>>>> +    unsigned int numpages;
>>>>     struct vcpu_runstate_info runstate;
>>>> +    void *p;
>>>> -    if ( guest_handle_is_null(runstate_guest(v)) )
>>>> -        return;
>>>> +    /* user can pass a NULL address to unregister a previous area */
>>>> +    if ( vaddr == 0 )
>>>> +        return 0;
>>>> +
>>>> +    offset = vaddr & ~PAGE_MASK;
>>>> +
>>>> +    /* provided address must be aligned to a 64bit */
>>>> +    if ( offset % alignof(struct vcpu_runstate_info) )
>>> 
>>> This new restriction wants to be explained in the commit message and public 
>>> header.
>> 
>> ok
>> 
>>> 
>>>> +    {
>>>> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 
>>>> 0x%"PRIvaddr
>>>> +                ": Invalid offset\n", vaddr);
>>> 
>>> We usually enforce 80 character per lines except for format string. So it 
>>> is easier to grep them in the code.
>> 
>> Ok i will fix this one and the following ones.
>> But here PRIvaddr would break any attempt to grep something in fact.
>> 
>>> 
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    page = get_page_from_gva(v, vaddr, GV2M_WRITE);
>>>> +    if ( !page )
>>>> +    {
>>>> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 
>>>> 0x%"PRIvaddr
>>>> +                ": Page is not mapped\n", vaddr);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    mfn[0] = page_to_mfn(page);
>>>> +    v->arch.runstate_guest_page[0] = page;
>>>> +
>>>> +    if ( offset > (PAGE_SIZE - sizeof(struct vcpu_runstate_info)) )
>>>> +    {
>>>> +        /* guest area is crossing pages */
>>>> +        page = get_page_from_gva(v, vaddr + PAGE_SIZE, GV2M_WRITE);
>>>> +        if ( !page )
>>>> +        {
>>>> +            put_page(v->arch.runstate_guest_page[0]);
>>>> +            gprintk(XENLOG_WARNING,
>>>> +                    "Cannot map runstate pointer at 0x%"PRIvaddr
>>>> +                    ": 2nd Page is not mapped\n", vaddr);
>>>> +            return -EINVAL;
>>>> +        }
>>>> +        mfn[1] = page_to_mfn(page);
>>>> +        v->arch.runstate_guest_page[1] = page;
>>>> +        numpages = 2;
>>>> +    }
>>>> +    else
>>>> +    {
>>>> +        v->arch.runstate_guest_page[1] = NULL;
>>>> +        numpages = 1;
>>>> +    }
>>>> -    memcpy(&runstate, &v->runstate, sizeof(runstate));
>>>> +    p = vmap(mfn, numpages);
>>>> +    if ( !p )
>>>> +    {
>>>> +        put_page(v->arch.runstate_guest_page[0]);
>>>> +        if ( numpages == 2 )
>>>> +            put_page(v->arch.runstate_guest_page[1]);
>>>> -    if ( VM_ASSIST(v->domain, runstate_update_flag) )
>>>> +        gprintk(XENLOG_WARNING, "Cannot map runstate pointer at 
>>>> 0x%"PRIvaddr
>>>> +                ": vmap error\n", vaddr);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    v->arch.runstate_guest = p + offset;
>>>> +
>>>> +    if (v == current)
>>>> +        memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate));
>>>> +    else
>>>>     {
>>>> -        guest_handle = &v->runstate_guest.p->state_entry_time + 1;
>>>> -        guest_handle--;
>>>> -        runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>>>> -        __raw_copy_to_guest(guest_handle,
>>>> -                            (void *)(&runstate.state_entry_time + 1) - 1, 
>>>> 1);
>>>> -        smp_wmb();
>>>> +        vcpu_runstate_get(v, &runstate);
>>>> +        memcpy(v->arch.runstate_guest, &runstate, sizeof(v->runstate));
>>>>     }
>>>> -    __copy_to_guest(runstate_guest(v), &runstate, 1);
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int arch_vcpu_setup_runstate(struct vcpu *v,
>>>> +                             struct vcpu_register_runstate_memory_area 
>>>> area)
>>>> +{
>>>> +    int rc;
>>>> +
>>>> +    spin_lock(&v->arch.runstate_guest_lock);
>>>> +
>>>> +    /* cleanup if we are recalled */
>>>> +    cleanup_runstate_vcpu_locked(v);
>>>> +
>>>> +    rc = setup_runstate_vcpu_locked(v, (vaddr_t)area.addr.v);
>>>> +
>>>> +    spin_unlock(&v->arch.runstate_guest_lock);
>>>> -    if ( guest_handle )
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +
>>>> +/* Update per-VCPU guest runstate shared memory area (if registered). */
>>>> +static void update_runstate_area(struct vcpu *v)
>>>> +{
>>>> +    spin_lock(&v->arch.runstate_guest_lock);
>>>> +
>>>> +    if ( v->arch.runstate_guest )
>>>>     {
>>>> -        runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
>>>> -        smp_wmb();
>>>> -        __raw_copy_to_guest(guest_handle,
>>>> -                            (void *)(&runstate.state_entry_time + 1) - 1, 
>>>> 1);
>>>> +        if ( VM_ASSIST(v->domain, runstate_update_flag) )
>>>> +        {
>>>> +            v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
>>>> +            write_atomic(&(v->arch.runstate_guest->state_entry_time),
>>>> +                    v->runstate.state_entry_time);
>>> 
>>> NIT: You want to indent v-> at the same level as the argument from the 
>>> first line.
>> 
>> Ok
>> 
>>> 
>>> Also, I think you are missing a smp_wmb() here.
>> 
>> The atomic operation itself would not need a barrier.
>> I do not see why you think a barrier is needed here.
>> For the internal structure ?
> 
> We need to make sure the other-end sees the XEN_RUNSTATE_UPDATE change
> before other changes. Otherwise, due to cpu reordering, the writes could
> be seen in reverse order. (Technically the reader would have to use a
> read-barrier but that's a separate topic.)

I will add a barrier before the atomic.

Cheers
Bertrand





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.