[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal


  • To: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • From: Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • Date: Fri, 19 Nov 2021 12:34:48 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=wKIzsis8jzC6mlCkpVbSvwn8b751EzjXgVY6Pu1F7Vc=; b=CAa8K1F+GjYLdzT6KHil9YtUso2GYhH28Q9OwFYcxZKhOLMwzH4lRfA6DMe9Htf1sTpi91rKZs8MrLmknvryussKpAdCHz5ATFqIyMKsrhnMhY89+Bn3WwUuSHB4Y32ENSL3nQLopToXgOb3TMg0zL1IcKvOFYzPCzverYZmfNkh0dlCOryooF8JWKCHNmjk2CiSkugzh+aL5BSsRv2phECFbr812T9E//h5jUCSfh9o9fNE38SLXMUHHtSzk6cLyscnjYNDaVcYlS7NUYL7MzodsgG5pmFugmlsqj68tm+xq49CeW3ddo5rrQpZlqaOlq8rmA/FzeeRXGIUBUNmZA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=d39SVbgekl35uyOcBtjYEdpAlxUau88CrKKU4Tr0qYJueoiBjQ6J1404HQy/e4HMCg3Rd1txWdfArIJ3xifJDGIZtPhRPAe8IlZ3DDRaWZu+4BY1eBsBq0tyA24Mr6oAch6Ok4hb5LDdSARFeaJ1OdnoKIcez8O7tvNbw4OZwAek/EVQ1SvxMn1j9Gv8a88uWDSNoddmhkng9RQ1vHrLibW+R7E0zahLwdgDHFL50isD6Ou9UbISh0YQHhOzQ/sQRwTwhzR/diarjhu4e+W0iRdPa9wtF5kqWKLrvUazSmgizP04im4wu9S+i4Zcm5fghFrqus3IHFhHxGKG+U+Mcg==
  • Cc: "julien@xxxxxxx" <julien@xxxxxxx>, "sstabellini@xxxxxxxxxx" <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Artem Mygaiev <Artem_Mygaiev@xxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>, "george.dunlap@xxxxxxxxxx" <george.dunlap@xxxxxxxxxx>, "paul@xxxxxxx" <paul@xxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Oleksandr Andrushchenko <Oleksandr_Andrushchenko@xxxxxxxx>
  • Delivery-date: Fri, 19 Nov 2021 12:35:10 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHX0hJIYALl/D9fL0OD6N0XGDJh2awHdhMAgAGHTYCAAA1EAIAABOkAgAAF6YCAAATnAIAAQRQAgAAGPgCAAAR1gIAAAvIAgAAFxACAAAoGAIAAAXMAgAABgYCAAAVtAIAAAX+AgAAB1wCAAVryAA==
  • Thread-topic: [PATCH v4 02/11] vpci: cancel pending map/unmap on vpci removal

Hi, Roger, Jan!

On 18.11.21 17:53, Jan Beulich wrote:
> On 18.11.2021 16:46, Oleksandr Andrushchenko wrote:
>> On 18.11.21 17:41, Jan Beulich wrote:
>>> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 17:16, Jan Beulich wrote:
>>>>>     For the moment I can't help thinking that draining would
>>>>> be preferable over canceling.
>>>> Given that cancellation is going to happen on error path or
>>>> on device de-assign/remove I think this can be acceptable.
>>>> Any reason why not?
>>> It would seem to me that the correctness of a draining approach is
>>> going to be easier to prove than that of a canceling one, where I
>>> expect races to be a bigger risk. Especially something that gets
>>> executed infrequently, if ever (error paths in particular), knowing
>>> things are well from testing isn't typically possible.
>> Could you please then give me a hint how to do that:
>> 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci
>> 2. We have de-assign/remove on vCPU1
>>
>> How do we drain that? Do you mean some atomic variable to be
>> used in vpci_process_pending to flag it is running and de-assign/remove
>> needs to wait and spinning checking that?
> First of all let's please keep remove and de-assign separate. I think we
> have largely reached agreement that remove may need handling differently,
> for being a Dom0-only operation.
>
> As to draining during de-assign: I did suggest before that removing the
> register handling hooks first would guarantee no new requests to appear.
> Then it should be merely a matter of using hypercall continuations until
> the respective domain has no pending requests anymore for the device in
> question. Some locking (or lock barrier) may of course be needed to
> make sure another CPU isn't just about to pend a new request.
>
> Jan
>
>
Too long, but please read.

The below is some simplified analysis of what is happening with
respect to deferred mapping. First we start from looking at what
hypercals are used which may run in parallel with vpci_process_pending,
which lock they hold:

1. do_physdev_op(PHYSDEVOP_pci_device_add): failure during 
PHYSDEVOP_pci_device_add
===============================================================================
   pci_physdev_op: <- no hypercall_create_continuation
     pci_add_device  <- pcidevs_lock()
       vpci_add_handlers
         init_bars
           cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
             modify_bars <- if cmd & PCI_COMMAND_MEMORY
               struct rangeset *mem = rangeset_new(NULL, NULL, 0);

               if ( system_state < SYS_STATE_active ) <- Dom0 is being created
                  return apply_map(pdev->domain, pdev, mem, cmd);

               defer_map(dev->domain, dev, mem, cmd, rom_only); < Dom0 is 
running
                 curr->vpci.pdev = pdev;
                 curr->vpci.mem = mem;

       ret = iommu_add_device(pdev); <- FAIL
       if ( ret )
           vpci_remove_device
             remove vPCI register handlers
             xfree(pdev->vpci);
             pdev->vpci = NULL; <- this will crash vpci_process_pending if it 
was
                                   scheduled and yet to run

2. do_physdev_op(PHYSDEVOP_pci_device_remove)
===============================================================================
   pci_physdev_op: <- no hypercall_create_continuation
     pci_remove_device <- pcidevs_lock()
       vpci_remove_device
         pdev->vpci = NULL; <- this will crash vpci_process_pending if it was
                               scheduled and yet to run

3. iommu_do_pci_domctl(XEN_DOMCTL_assign_device)
===============================================================================
case XEN_DOMCTL_assign_device <- pcidevs_lock();
   ret = assign_device(d, seg, bus, devfn, flags);
   if ( ret == -ERESTART )
     ret = hypercall_create_continuation(__HYPERVISOR_domctl, "h", u_domctl);


4. iommu_do_pci_domctl(XEN_DOMCTL_deassign_device) <- pcidevs_lock();
===============================================================================
case XEN_DOMCTL_deassign_device: <- no hypercall_create_continuation
   ret = deassign_device(d, seg, bus, devfn);


5. vPCI MMIO trap for PCI_COMMAND
===============================================================================
vpci_mmio_{read|write}
   vpci_ecam_{read|write}
     vpci_{read|write} <- NO locking yet
       pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn);
       spin_lock(&pdev->vpci->lock);
         cmd_write
           modify_bars
             defer_map

6. SoftIRQ processing
===============================================================================
hvm_do_resume
   vcpu_ioreq_handle_completion
     vpci_process_pending
       if ( v->vpci.mem )
         rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
         if ( rc == -ERESTART )
             return true; <- re-scheduling

=========================================================================
         spin_lock(&v->vpci.pdev->vpci->lock); <- v->vpci.pdev->vpci can be NULL
=========================================================================
         spin_unlock(&v->vpci.pdev->vpci->lock);
         v->vpci.mem = NULL;
       if ( rc ) <- rc is from rangeset_consume_ranges
         vpci_remove_device <- this is a BUG as it is potentially possible that
                               vpci_process_pending is running on other vCPU

So, from the above it is clearly seen that it might be that there is a
PCI_COMMAND's write triggered mapping is happening on other vCPU in parallel
with a hypercall.

Some analysis on the hypercalls with respect to domains which are eligible 
targets.
1. Dom0 (hardware domain) only: PHYSDEVOP_pci_device_add, 
PHYSDEVOP_pci_device_remove
2. Any domain: XEN_DOMCTL_assign_device, XEN_DOMCTL_deassign_device

Possible crash paths
===============================================================================
1. Failure in PHYSDEVOP_pci_device_add after defer_map may make
vpci_process_pending crash because of pdev->vpci == NULL
2. vpci_process_pending on other vCPU may crash if runs in parallel with itself
because of vpci_remove_device may be called
3. Crash in vpci_mmio_{read|write} after PHYSDEVOP_pci_device_remove
vpci_remove_device makes pdev->vpci == NULL
4. Both XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device seem to be
unaffected.

Synchronization is needed between:
  - vpci_remove_device
  - vpci_process_pending
  - vpci_mmio_{read|write}

Possible locking and other work needed:
=======================================

1. pcidevs_{lock|unlock} is too heavy and is per-host
2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
3. We may want a dedicated per-domain rw lock to be implemented:

diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 28146ee404e6..ebf071893b21 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -444,6 +444,7 @@ struct domain

  #ifdef CONFIG_HAS_PCI
      struct list_head pdev_list;
+    rwlock_t vpci_rwlock;
+    bool vpci_terminating; <- atomic?
  #endif
then vpci_remove_device is a writer (cold path) and vpci_process_pending and
vpci_mmio_{read|write} are readers (hot path).

do_physdev_op(PHYSDEVOP_pci_device_remove) will need 
hypercall_create_continuation
to be implemented, so when re-start removal if need be:

vpci_remove_device()
{
   d->vpci_terminating = true;
   remove vPCI register handlers <- this will cut off PCI_COMMAND emulation 
among others
   if ( !write_trylock(d->vpci_rwlock) )
     return -ERESTART;
   xfree(pdev->vpci);
   pdev->vpci = NULL;
}

Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
other operations which may require it, e.g. virtual bus topology can
use it when assigning vSBDF etc.

4. vpci_remove_device needs to be removed from vpci_process_pending
and do nothing for Dom0 and crash DomU otherwise:

if ( rc )
{
   /*
    * FIXME: in case of failure remove the device from the domain.
    * Note that there might still be leftover mappings. While this is
    * safe for Dom0, for DomUs the domain needs to be killed in order
    * to avoid leaking stale p2m mappings on failure.
    */
   if ( !is_hardware_domain(v->domain) )
     domain_crash(v->domain);

I do hope we can finally come up with some decision which I can implement...

Thank you,
Oleksandr

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.