[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 1/2] IOMMU/x86: disallow device assignment to PoD guests


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Wed, 2 Feb 2022 17:13:09 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=wI/5xrT1lExdJ0pV+JJXv2YQ/AZnziy2JyXWmQQBDF4=; b=CJlosZLmUjVioBJxZbzpGMDPkO0U4aoTp99Q4jd/HtwmBUj3lVHJkdwcOVclbPKqJgxP8tazbvNLPGF/bm4DavkBYpsPFh/GBrQKC2kuMvE/Z3l85fdt7s/nFJLTIdEAost1Z4MZqQ7/r66IU+U+erh1JtvvxI8v3jv0c+vuW9dUOtwj44zcoHbiJx8yv4pOOQ8Z9O027OJySLbRrCXmFiDEvmhZg+d7BqiwQW8fOvdrpCAnuM7qIDWzaAFMn157lctUaARA1WffKWlbYoZ5En2YSYOnp1rtcIzlk6M9t3jhgVqHZiibGBTW9QodL4ZavpbD9Jmf+W/OBJ6XVtIOfw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=nPUzUjBfe5PtrgZpf1ME/IVffgNvoPNY2BM8Zd2lYl1abwrmYyHlkm9Ji2JvFkVP56vJUb57o8+E7QPy7G7ZxO2HzzoQxTsdsY4lTL/EC7t41ciNvM5wW1xojujYIZ6t2OuWsVgBscKf5797bUF6MEMK4h1Wm5RzNlbKsJguvHqw3uqPEpsc7ZL/32LvqYARgWFqhRzfLAi1hgGHP8WiA9awvtaJvLPVHoFZdqAtuzYpVoJNS2ea6NDnItBp7Dz8BGDxdZPFDq3yvDrB3Y0ljLovx2EOjrWQtq2mSLZbd6paN+ahMJcvIcR8QzJrXioAyS8gNkacGuo7cJD0PtmFLg==
  • Authentication-results: esa5.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Tamas K Lengyel <tamas@xxxxxxxxxxxxx>, Petre Pircalabu <ppircalabu@xxxxxxxxxxxxxxx>, Alexandru Isaila <aisaila@xxxxxxxxxxxxxxx>
  • Delivery-date: Wed, 02 Feb 2022 16:13:47 +0000
  • Ironport-data: A9a23:8kUUHKzqS6oH3DFZLr56t+eCwSrEfRIJ4+MujC+fZmUNrF6WrkUGn DQeWWGPP/yCZzH8coonPN7l/R8EucSHyYcyHldu/iAxQypGp/SeCIXCJC8cHc8zwu4v7q5Dx 59DAjUVBJlsFhcwnvopW1TYhSEUOZugH9IQM8aZfHAhLeNYYH1500g7wbZj2tQAbeWRWGthh /uj+6UzB3f9s9JEGjp8B3Wr8U4HUFza4Vv0j3RmDRx5lAa2e0o9VfrzEZqZPXrgKrS4K8bhL wr1IBNVyUuCl/slIovNfr8W6STmSJaKVeSFoiI+t6RPHnGuD8H9u0o2HKN0VKtZt9mGt817m MQWkoWJcgYWPbXrof4ZXxhyLggraMWq+JefSZS+mcmazkmAeHrw2fR+SkoxOOX0+M4uXzsIr 6ZBbmlQMFbT3Ipaw5riIgVort4kI8TxepsWp1lrzC3DDOZgSpfGK0nPzYECg2Zh25wWdRrYT 9ETaQU+fC/eWTtOHggVMq1ug9X03ECqJlW0r3rK/PFqsgA/1jdZzLHoOcH9Zt+OSMNaj0uc4 GnB+gzRGgkbLteWzTOP71qmh/PDkC32Xo4fDvuz8fsCqFSS3WUSDBQ+X1qnrfS3h0iyVsgZI EsRkgI2pLU23FymSJ/6RRLQiHyOswMYWtFQO/Yn8wzLwa3Riy6JC25BQjNfZdgOsM4tWSdsx lKPh8nuBzFkrPuSU3313qiQhSO/P24SN2BqWMMfZVJbuZ+5+th110+RCIY4eEKosjHrMSz6k yHRizEHvLQsjtYs7r+/8VHtvhv58/AlUTUJzgnQW2uk6CZwa4ike5Gk5DDn0BpQEGqKZgLf5 SZZwqBy+MhLVMjQz3LVHI3hCZn0v67tDdHKvbJ483DNHRyJ8mXrQ41f6SoWyKxBYpddIm+Bj KM+VGpsCH5v0JmCMf4fj2GZUZ1CIU3c+TPNDai8Uza2SsItHDJrBQk3DaJq40jjkVI3jYY0M oqBfMCnAB4yUPo7l2XqHrZMjOZ1lkjSIF8/origl3yaPUe2PibJGd/pznPSBgzG0E90iFqMq IsOXyd74x5eTPf/ckHqHX07djg3wYwALcmu8aR/L7fbSiI/QT1JI6KPndsJJtI094wIxrag1 izsASdwlQug7UAr3C3XMBiPnpu1A8YmxZ/6VARxVWuVN48LOtfytfpPKsdsINHKNoVLlJZJc hXMQO3ZatxnQTXb4TUNK577qY1pbhOwggySeSGiZVACk1RIHVWhFgbMclS9+S8QIDCwsMdi8 bSs2hmCGcgIRhh4DdaQY/WqlgvjsX8YkeN0fk3JPtgMJxm8rNk0c3T83q0tPsUBCRTf3T/Gh QyYNggV+LvWqIgv/diX2a3d99W1E/FzF1ZxFnXA6erkLjHT+2eumNcSUOuBcT3Hennz/aGuO bdcw/3maaVVl1dWqYtsVb1syPtmtdfoorZbyCViHWnKMAv3Wu8xfCHe0JAW5KNXx7JftQ+nY W61+4FXaeeTJcfoMF8NPw55PO6N4u4Zx2vJ5vMvLUSkuCIupOibUV9fNgWngTBGKOcnK5ssx OostZJE6wG7jRZ2YN+KgjoNqjaJJ30EFa4mqosbEMngjQ9ykgNOZpnVCynX5pCTaooTbhl2c 2HM3KeS1a5BwkficmYoESmf1OVQsp0CpRRWwQJQPF+OgNfE2qc60RA5He7bleiJIsGrC95OB 1U=
  • Ironport-hdrordr: A9a23:n90JJq/Ei7nYLXYaiIpuk+FAdb1zdoMgy1knxilNoENuHfBwxv rDoB1E73LJYVYqOU3Jmbi7Sc69qFfnhORICO4qTMqftWjdyRCVxeRZg7cKrAeQeREWmtQtsJ uINpIOdOEYbmIK/PoSgjPIaurIqePvmMvD5Za8854ud3ATV0gJ1XYGNu/xKDwReOApP+tcKH LKjfA32AZINE5nJfiTNz0gZazuttfLnJXpbVovAAMm0hCHiXeN5KThGxaV8x8CW3cXqI1SvF Ttokjc3OGOovu7whjT2yv66IlXosLozp9mCNaXgsYYBz3wgkKDZZhnWZeFoDcpydvfomoCoZ 3pmVMNLs5z43TeciWcpgbs4RDp1HIU53rr2Taj8A3eiP28YAh/J9tKhIpffBecwVEnpstA3K VC2H/cn4ZLDDvb9R6NqeTgZlVPrA6ZsHAimekcgzh0So0FcoJcqoQZ4Qd8DIoAJiTn84oqed MeQ/003MwmMW9yUkqp/VWGmLeXLzYO91a9MwQ/U/WuonlrdCsT9Tpc+CQd9k1wg67VBaM0o9 gsCZ4Y542mePVmGZ6VNN1xMfdfNVa9My4kEFjiaGgPR5t3c04klfbMkcAIDaeRCds18Kc=
  • Ironport-sdr: wBIQR5nyzNMDc+p6JVKXU9CGFc/LH1xrLFzdk9Ca3gEtutfErZL5cljCCO2rCibV3JdrJvG7lj tJfbRYYckcoR3mHgFxfp6w5YQFm9XNLx0RP1vm7Yl+cnP798+KX3V2kqfr7DELU+nhPxPQe2Gj sUGy2l/OjrQHhToLPd/kqmTwuXOb9VgoRovmkWaUMTDbncoYN5YWCTCSMahj/I/rTGzjWFZw74 ArcAyCoiPQx5q6BNWe/RgP/BM6s/ZoXbE3L4RUeBnRLUydDUzW2gZXQNDwavRjnAse2ECnY66t iqMSbk4Wpsm/1SsmJoH6kQ82
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Jan 04, 2022 at 10:41:32AM +0100, Jan Beulich wrote:
> While it is okay for IOMMU page tables to get set up for guests starting
> in PoD mode, actual device assignment may only occur once all PoD
> entries have been removed from the P2M. So far this was enforced only
> for boot-time assignment, and only in the tool stack.
> 
> Also use the new function to replace p2m_pod_entry_count(): Its unlocked
> access to p2m->pod.entry_count wasn't really okay (irrespective of the
> result being stale by the time the caller gets to see it).
> 
> To allow the tool stack to see a consistent snapshot of PoD state, move
> the tail of XENMEM_{get,set}_pod_target handling into a function, adding
> proper locking there.
> 
> In libxl take the liberty to use the new local variable r also for a
> pre-existing call into libxc.
> 
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> If p2m->pod.entry_count == p2m->pod.count it is in principle possible to
> permit device assignment by actively resolving all remaining PoD entries.
> 
> Initially I thought this was introduced by f89f555827a6 ("remove late
> (on-demand) construction of IOMMU page tables"), but without
> arch_iommu_use_permitted() checking for PoD I think the issue has been
> there before that.
> ---
> v3: In p2m_pod_set_mem_target() move check down.
> v2: New.
> 
> --- a/tools/libs/light/libxl_pci.c
> +++ b/tools/libs/light/libxl_pci.c
> @@ -1619,8 +1619,13 @@ void libxl__device_pci_add(libxl__egc *e
>      pas->callback = device_pci_add_stubdom_done;
>  
>      if (libxl__domain_type(gc, domid) == LIBXL_DOMAIN_TYPE_HVM) {
> -        rc = xc_test_assign_device(ctx->xch, domid, pci_encode_bdf(pci));
> -        if (rc) {
> +        int r;
> +        uint64_t cache, ents;
> +
> +        rc = ERROR_FAIL;
> +
> +        r = xc_test_assign_device(ctx->xch, domid, pci_encode_bdf(pci));
> +        if (r) {
>              LOGD(ERROR, domid,
>                   "PCI device %04x:%02x:%02x.%u %s?",
>                   pci->domain, pci->bus, pci->dev, pci->func,
> @@ -1628,6 +1633,22 @@ void libxl__device_pci_add(libxl__egc *e
>                   : "already assigned to a different guest");
>              goto out;
>          }
> +
> +        r = xc_domain_get_pod_target(ctx->xch, domid, NULL, &cache, &ents);
> +        if (r) {
> +            LOGED(ERROR, domid, "Cannot determine PoD status");
> +            goto out;
> +        }
> +        /*
> +         * In principle it is sufficient for the domain to have ballooned 
> down
> +         * enough such that ents <= cache.  But any remaining entries would
> +         * need resolving first.  Until such time when this gets effected,
> +         * refuse assignment as long as any entries are left.
> +         */
> +        if (ents /* > cache */) {
> +            LOGD(ERROR, domid, "Cannot assign device with PoD still active");
> +            goto out;
> +        }
>      }
>  
>      rc = libxl__device_pci_setdefault(gc, domid, pci, !starting);
> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -20,6 +20,7 @@
>   */
>  
>  #include <xen/event.h>
> +#include <xen/iocap.h>
>  #include <xen/ioreq.h>
>  #include <xen/mm.h>
>  #include <xen/sched.h>
> @@ -359,7 +360,10 @@ p2m_pod_set_mem_target(struct domain *d,
>  
>      ASSERT( pod_target >= p2m->pod.count );
>  
> -    ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/);
> +    if ( has_arch_pdevs(d) || cache_flush_permitted(d) )

Is it possible to have cache flush allowed without any PCI device
assigned? AFAICT the iomem/ioport_caps would only get setup when there
are device passed through?

TBH I would be fine if we just say that PoD cannot be used in
conjunction with an IOMMU, and just check for is_iommu_enable(d) here.

I understand it's technically possible for PoD to be used together
with a domain that will later get a device passed through once PoD is
no longer in use, but I doubt there's much value in supporting that
use case, and I fear we might be introducing corner cases that could
create issues in the future. Overall I think it would be safer to just
disable PoD in conjunction with an IOMMU.

> +        ret = -ENOTEMPTY;
> +    else
> +        ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/);
>  
>  out:
>      pod_unlock(p2m);
> @@ -367,6 +371,23 @@ out:
>      return ret;
>  }
>  
> +void p2m_pod_get_mem_target(const struct domain *d, xen_pod_target_t *target)
> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +
> +    ASSERT(is_hvm_domain(d));
> +
> +    pod_lock(p2m);
> +    lock_page_alloc(p2m);
> +
> +    target->tot_pages       = domain_tot_pages(d);
> +    target->pod_cache_pages = p2m->pod.count;
> +    target->pod_entries     = p2m->pod.entry_count;
> +
> +    unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
> +}
> +
>  int p2m_pod_empty_cache(struct domain *d)
>  {
>      struct p2m_domain *p2m = p2m_get_hostp2m(d);
> @@ -1384,6 +1405,9 @@ guest_physmap_mark_populate_on_demand(st
>      if ( !paging_mode_translate(d) )
>          return -EINVAL;
>  
> +    if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
> +        return -ENOTEMPTY;
> +
>      do {
>          rc = mark_populate_on_demand(d, gfn, chunk_order);
>  
> @@ -1405,3 +1429,20 @@ void p2m_pod_init(struct p2m_domain *p2m
>      for ( i = 0; i < ARRAY_SIZE(p2m->pod.mrp.list); ++i )
>          p2m->pod.mrp.list[i] = gfn_x(INVALID_GFN);
>  }
> +
> +bool p2m_pod_active(const struct domain *d)
> +{
> +    struct p2m_domain *p2m;
> +    bool res;
> +
> +    if ( !is_hvm_domain(d) )
> +        return false;
> +
> +    p2m = p2m_get_hostp2m(d);
> +
> +    pod_lock(p2m);
> +    res = p2m->pod.entry_count | p2m->pod.count;
> +    pod_unlock(p2m);
> +
> +    return res;
> +}
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4778,7 +4778,6 @@ long arch_memory_op(unsigned long cmd, X
>      {
>          xen_pod_target_t target;
>          struct domain *d;
> -        struct p2m_domain *p2m;
>  
>          if ( copy_from_guest(&target, arg, 1) )
>              return -EFAULT;
> @@ -4787,7 +4786,9 @@ long arch_memory_op(unsigned long cmd, X
>          if ( d == NULL )
>              return -ESRCH;
>  
> -        if ( cmd == XENMEM_set_pod_target )
> +        if ( !is_hvm_domain(d) )
> +            rc = -EINVAL;
> +        else if ( cmd == XENMEM_set_pod_target )
>              rc = xsm_set_pod_target(XSM_PRIV, d);
>          else
>              rc = xsm_get_pod_target(XSM_PRIV, d);
> @@ -4813,10 +4814,7 @@ long arch_memory_op(unsigned long cmd, X
>          }
>          else if ( rc >= 0 )
>          {
> -            p2m = p2m_get_hostp2m(d);
> -            target.tot_pages       = domain_tot_pages(d);
> -            target.pod_cache_pages = p2m->pod.count;
> -            target.pod_entries     = p2m->pod.entry_count;
> +            p2m_pod_get_mem_target(d, &target);
>  
>              if ( __copy_to_guest(arg, &target, 1) )
>              {
> --- a/xen/common/vm_event.c
> +++ b/xen/common/vm_event.c
> @@ -639,7 +639,7 @@ int vm_event_domctl(struct domain *d, st
>  
>              rc = -EXDEV;
>              /* Disallow paging in a PoD guest */
> -            if ( p2m_pod_entry_count(p2m_get_hostp2m(d)) )
> +            if ( p2m_pod_active(d) )

Isn't it fine to just check for entry_count like you suggest in the
change to libxl? This is what p2m_pod_entry_count actually does
(rather than entry_count | count).

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.