[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 3 May 2022 16:49:50 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=yyFIvEtYwDFKe2vs2Og2sJf3T0l4JaHHBJ1hwLGjVkE=; b=Vq1DFSsRi45F5MoIdhnY2dRFCDfnYv7JnoaKkem6Q+xrH8grrwzlRzSCZiKrqrfDiW1Q8e4OvjOVsASgZhUPbkbLsMMScLspSVOvzzMcWYSkZTazhol0Npdkyhy1pVJCuk1yp81RCtFFFYDVaiZDzCS0+1RtmxpyAE/uQnAgXpEo/L+fiGxAmdQzHA41/0Bp3hktRPET8mPZ2XU3IQnfN3uL7Is2roMwmNdiGTt6i0NxM6eZKhHEYse3PNtLJNDknQGMSAv8GVQ0FgNdcL4gOZ3c/hxMX4/TXQYsNlX773z6Ga2Twyy614avWVJutexsVYpGC0pPpxaMg2Kr7+X68g==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=XsrNu3/8iybUgRYmC8qnCzz3CCZKax1J+Z8aeP4mDGKp41A4YQ0LEF4nMpai7GnkXzzGoNbI6J4SNzORDePH4duWM3AIfIyI+WlP5bbsI6DicXRvuBBYwE0j/GtcltONfVVPcfMB4Il0TsGy+L4uJ/hM8yR5N/aV3kqEgNEJ3mABU98AUpULTuvdKIsuGPhfYFNktkStOnnaYCOqOQbKNT54DvMulgBJl3VTTm11tO8i27jSFuml9J40eKdCb6WMh9lLAVbQprggu6AtkQ6WFQKQCkH3kY4+p4zu4ZY1UwQaMJ/MkzdDS4Lj9EVWkcEIu1C2Zb0EMperpoVAs4zImA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Wei Liu <wl@xxxxxxx>
  • Delivery-date: Tue, 03 May 2022 14:50:11 +0000
  • Ironport-data: A9a23:98tbEK1t8bQcoF7HS/bD5alwkn2cJEfYwER7XKvMYLTBsI5bpzUFy TAbDzjQO/zcYGT3fNolb4Wwo01U6J/Uz4RnTAVspC1hF35El5HIVI+TRqvS04J+DSFhoGZPt Zh2hgzodZhsJpPkjk7xdOCn9xGQ7InQLlbGILes1htZGEk1EU/NtTo5w7Rj2tMw2oDja++wk YiaT/P3aQfNNwFcagr424rbwP+4lK2v0N+wlgVWicFj5DcypVFMZH4sDfjZw0/DaptVBoaHq 9Prl9lVyI97EyAFUbtJmp6jGqEDryW70QKm0hK6UID66vROS7BbPg/W+5PwZG8O4whlkeydx /1Ald+LdT8TEJfut+YfAgNGN30uZPV/reqvzXiX6aR/zmXgWl61mbBEKhFzOocVvOFqHWtJ6 PoUbigXaQyOjP63x7T9TfRwgsMkL4/gO4Z3VnNIlGmFS6p5B82eBfyVvre03x9p7ixKNezZa McDLyJmcTzLYgFVO0dRA5U79AutrianKGAB8g/KzUYxy1aUkypq8Yb9CdnQINuvb9dIw3S6p m2TqgwVBTlfbrRz0wGt8Hihm+vOliPTQ58JGfuz8fsCqE2ewCkfBQMbUXO/oOKlkQiuVtRHM UsW9yEy668o+ySDVtDgWzWorXjCuQQTM/JPF8Uq5QfLzbDbizt1HUABRz9FLdYg5Mk/QGVw0 kfTxoyyQztyrLeSVHSRsK+Oqi+/MjQUKmlEYjIYSQwC4J/op4RbYg/zc+uP2ZWd1rXdcQwcC RjTxMTir93/VfI26pg=
  • Ironport-hdrordr: A9a23:aw4CGa/Bdl2z9KOHfIJuk+FEdb1zdoMgy1knxilNoENuH/Bwxv rFoB1E73TJYVYqN03IV+rwXZVoZUmsjaKdgLNhRItKOTOLhILGFuFfBOfZsl7d8mjFh5VgPM RbAtRD4b/LfD9HZK/BiWHXcurIguP3lpxA7d2uskuFJjsaD52IgT0JaDpyRSZNNXN77NcCZe yhz/sCgwDlVWUcb8y9CHVAd+/fp+fTnJajRRIdHRYo5CSHkDvtsdfBYlCl9yZbdwkK7aYp8G DDnQC8zqK/s8ujwhuZ82PI9ZxZlPbo19MGLs2Rjco+LCnql2+TFc1ccozHmApwjPCk6V4snt WJixA8P/5r43eURW2xqQuF4XiU7B8er1vZjXOIi3rqpsL0ABggDdBauI5fehzFr2I9odBVys twri6knqsSKSmFsDX25tDOWR0vvFGzu2AenekaiGEaeZcCaYVWsZcU8CpuYdo99RrBmc4a+d RVfYDhDK48SyLbU5mZhBgk/DWUZAV9Iv/cKXJy+fB80FBt7QJEJgUjtY4id0w7hewAoql/lp v525tT5cBzp+8tHNZA7bQ6MLyK4lKke2O9DEuiZXLaKYogB1Xh77bK3ZRd3pDYRHVP9up4pK j8
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
> For large page mappings to be easily usable (i.e. in particular without
> un-shattering of smaller page mappings) and for mapping operations to
> then also be more efficient, pass batches of Dom0 memory to iommu_map().
> In dom0_construct_pv() and its helpers (covering strict mode) this
> additionally requires establishing the type of those pages (albeit with
> zero type references).

I think it's possible I've already asked this.  Would it make sense to
add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
specific flag?

It would seem to me that doing it that way would also allow the
mappings to get established in blocks for domUs.

And be less error prone in having to match memory allocation with
iommu_memory_setup() calls in order for the pages to be added to the
IOMMU page tables.

> The earlier establishing of PGT_writable_page | PGT_validated requires
> the existing places where this gets done (through get_page_and_type())
> to be updated: For pages which actually have a mapping, the type
> refcount needs to be 1.
> 
> There is actually a related bug that gets fixed here as a side effect:
> Typically the last L1 table would get marked as such only after
> get_page_and_type(..., PGT_writable_page). While this is fine as far as
> refcounting goes, the page did remain mapped in the IOMMU in this case
> (when "iommu=dom0-strict").
> 
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> Subsequently p2m_add_identity_entry() may want to also gain an order
> parameter, for arch_iommu_hwdom_init() to use. While this only affects
> non-RAM regions, systems typically have 2-16Mb of reserved space
> immediately below 4Gb, which hence could be mapped more efficiently.

Indeed.

> The installing of zero-ref writable types has in fact shown (observed
> while putting together the change) that despite the intention by the
> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
> sufficiently ordinary pages (at the very least initrd and P2M ones as
> well as pages that are part of the initial allocation but not part of
> the initial mapping) still have been starting out as PGT_none, meaning
> that they would have gained IOMMU mappings only the first time these
> pages would get mapped writably. Consequently an open question is
> whether iommu_memory_setup() should set the pages to PGT_writable_page
> independent of need_iommu_pt_sync().

I think I'm confused, doesn't the setting of PGT_writable_page happen
as a result of need_iommu_pt_sync() and having those pages added to
the IOMMU page tables? (so they can be properly tracked and IOMMU
mappings are removed if thte page is also removed)

If the pages are not added here (because dom0 is not running in strict
mode) then setting PGT_writable_page is not required?

> I didn't think I need to address the bug mentioned in the description in
> a separate (prereq) patch, but if others disagree I could certainly
> break out that part (needing to first use iommu_legacy_unmap() then).
> 
> Note that 4k P2M pages don't get (pre-)mapped in setup_pv_physmap():
> They'll end up mapped via the later get_page_and_type().
> 
> As to the way these refs get installed: I've chosen to avoid the more
> expensive {get,put}_page_and_type(), favoring to put in place the
> intended type directly. I guess I could be convinced to avoid this
> bypassing of the actual logic; I merely think it's unnecessarily
> expensive.

In a different piece of code I would have asked to avoid open-coding
the type changes.  But there are already open-coded type changes in
dom0_construct_pv(), so adding those doesn't make the current status
worse.

> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
> occurs only afterwards anyway, so nothing can have made it into TLBs
> just yet.

Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
away, as we must strictly do the hwdom init before enabling the iommu
itself.

The one in dom0 build I'm less convinced, just to be on the safe side
if we ever change the order of IOMMU init and memory setup.  I would
expect flushing an empty TLB to not be very expensive?

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>  
>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>  {
> -    unsigned long i, top, max_pfn;
> -    unsigned int flush_flags = 0;
> +    unsigned long i, top, max_pfn, start, count;
> +    unsigned int flush_flags = 0, start_perms = 0;
>  
>      BUG_ON(!is_hardware_domain(d));
>  
> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>       * setting up potentially conflicting mappings here.
>       */
> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>  
> -    for ( ; i < top; i++ )
> +    for ( i = start, count = 0; i < top; )
>      {
>          unsigned long pfn = pdx_to_pfn(i);
>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>          if ( !perms )
>              rc = 0;
>          else if ( paging_mode_translate(d) )
> +        {
>              rc = p2m_add_identity_entry(d, pfn,
>                                          perms & IOMMUF_writable ? 
> p2m_access_rw
>                                                                  : 
> p2m_access_r,
>                                          0);
> +            if ( rc )
> +                printk(XENLOG_WARNING
> +                       "%pd: identity mapping of %lx failed: %d\n",
> +                       d, pfn, rc);
> +        }
> +        else if ( pfn != start + count || perms != start_perms )
> +        {
> +        commit:
> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
> +                           &flush_flags);
> +            if ( rc )
> +                printk(XENLOG_WARNING
> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: 
> %d\n",
> +                       d, pfn, pfn + count, rc);
> +            SWAP(start, pfn);
> +            start_perms = perms;
> +            count = 1;
> +        }
>          else
> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
> -                           perms, &flush_flags);
> +        {
> +            ++count;
> +            rc = 0;

Seeing as we want to process this in blocks now, I wonder whether it
would make sense to take a different approach, and use a rangeset to
track which regions need to be mapped.  What gets added would be based
on the host e820 plus the options
iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
based on the logic in hwdom_iommu_map() and finally we could iterate
over the regions afterwards using rangeset_consume_ranges().

Not that you strictly need to do it here, just think the end result
would be clearer.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.