[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 16/21] VT-d: free all-empty page tables


  • To: Jan Beulich <jbeulich@xxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 10 May 2022 16:30:17 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=FkPxeKH8HAYRWYOLD7bIbXYFqahuxamVPN/5WElQmo0=; b=M3esar28zkkwStpeCjAi+9/dYLVutuxQOTwk9MLGPeViTqBv/yjjeYxxP3iYwEvJxbRmwJg56MGnzYPUqzPF2hqen20sjYp+Mmu5BEXeggpGRS/xjW/ETDLbReimyf4+YLouWIRVQIhZ1HbZ4scscbiVda5zYffonz5Nw9h3sedB5lejYar+P2F3TcMf7Zy1MxaJf/weEto9u9ZAlYB0mImiobwDY7SylWLPQvTkbLp2szurIeruc/gxWrOiPIAmzD+iMwovgqHHDBiq39Za35lNN4vv7oJJ6VoXGH7K2th5xmeIeiwhwH+rim1QWTeyOaK0Zoqu1ZxX/rbqOkL0zw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dXFPXUuTSacc4BllRLTlx1sG5pTdi7ltkiCoXrGZu+zctMyziruF0LWGd2lrMUXd28paikHTVjHU2EYgzUf+CMtRodQzBBK7cTSomXQCZSjqADtfVlHvP+fv3LCsxJdxQSNqQhL5kWvD9PApnR0tW2sFQHnzQsscyIHiTzI2t/y/dZ1xPxIXJ0zDVT0q4BIMG4oS0arAzJLhglbPM49Lwx9DCNGPi3R+ngLZgMkC39yuB8/rfulUPyI6PPZyONwva97AovJ0mgIBv41HdmoK83V7k3HECbtdjvQYao5AT6a1tm5QODgg6lEan/Nes3nFnsNhcFruT2RCYbVYb4g+QQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>
  • Delivery-date: Tue, 10 May 2022 14:30:33 +0000
  • Ironport-data: A9a23:jICCOajaIqF4J/XuQGmMyKVsX161bBEKZh0ujC45NGQN5FlHY01je htvD2uPbqrfYjSmco1wYIuz80gCu8XQmtQ2HQNl/n82FyIb9cadCdqndUqhZCn6wu8v7a5EA 2fyTvGacajYm1eF/k/F3oDJ9CU6jefSLlbFILas1hpZHGeIcw98z0M68wIFqtQw24LhXlrV4 YmaT/D3YzdJ5RYlagr41IrbwP9flKyaVOQw5wFWiVhj5TcyplFNZH4tDfjZw0jQG+G4KtWSV efbpIxVy0uCl/sb5nFJpZ6gGqECaua60QFjERO6UYD66vRJjnRaPqrWqJPwwKqY4tmEt4kZ9 TlDiXC/YTkVHYbou7QxaT0CUAJZJqNgpbTeIGfq5KR/z2WeG5ft69NHKRhueKgnoKNwC2wI8 uEEIjcQaBzFn/ix3L+wVuhrgIIkMdXvO4Qc/HpnyFk1D95/GcyFH/qMuIIehW9p7ixNNa+2i 84xcz1gYQ6GexRSElwWFIg/jKGjgXyXnzhw9wnJ+vFnuDS7IApZl52zHPSIINWzYu5PwUKBu zvJvH76HURPXDCY4X/fmp62vcfNly7mXIMZFJWj6+VnxlaUwwQ7CxAIVF39vfiwjGa/Xc5SL wof/S9Ghbg/8gmnQ8fwWzW8oWWYpVgMVtxICeo45QqRjK3O7G6xHmEZShZRZdpgs9U5LQHGz XeMltLtQDlw6rucTCvF8q/O9Gvrfy8IMWUFeCkICxMf5MXuq50yiRSJSct/FKmyjZv+HjSYL y22kRXSTo471aYjv5hXN3ic695wjvAlljII2zg=
  • Ironport-hdrordr: A9a23:EoQq2KEE3ykyVnq/pLqFe5HXdLJyesId70hD6qkvc3Fom52j/f xGws5x6faVslkssb8b6LW90Y27MAvhHP9OkPAs1NKZMDUO11HJEGgP1/qA/9SkIVyEygc/79 YdT0EdMqyWMbESt6+TjmiF+pQbsb+6GciT9JrjJhxWPGVXgs9bnmVE4lHxKDwNeOAKP+txKH LajfA31waISDAyVICWF3MFV+/Mq5nik4/nWwcPA1oC5BOVhT2lxbbmG1zAty1uGA9n8PMHyy zoggb57qKsv7WSzQLd7Xba69BzlMH6wtVOKcSQgow+KynqiCyveIN9Mofy9AwdkaWK0hIHgd PMqxAvM4Ba7G7QRHi8pV/X1wzpwF8Vmgvf4G7dpUGmjd3yRTo8BcYEr5leaAHl500pu8w5+L 5X3kqC3qAnQi/orWDY3ZzlRhtqnk27rT4JiugIlUFSVoMYdft4sZEfxkVIC50NdRiKpLzPKN MeTf002cwmMW9zNxvizypSKZ2XLzkO9y69MwY/Upf/6UkVoJh7p3FosfD30E1wsa7VcKM0md gsAp4Y642mcfVmHJ6VJN1xNvdfWVa9Ny4lDgqpUCfaPZBCHU7xgLjKx5hwzN2WWfUzvekPcd L6IRlliVI=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Apr 25, 2022 at 10:42:50AM +0200, Jan Beulich wrote:
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet,
> pt_update_contig_markers() right away needs to be called in all places
> where entries get updated, not just the one where entries get cleared.
> 
> Note further that while pt_update_contig_markers() updates perhaps
> several PTEs within the table, since these are changes to "avail" bits
> only I do not think that cache flushing would be needed afterwards. Such
> cache flushing (of entire pages, unless adding yet more logic to me more
> selective) would be quite noticable performance-wise (very prominent
> during Dom0 boot).
> 
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> v4: Re-base over changes earlier in the series.
> v3: Properly bound loop. Re-base over changes earlier in the series.
> v2: New.
> ---
> The hang during boot on my Latitude E6410 (see the respective code
> comment) was pretty close after iommu_enable_translation(). No errors,
> no watchdog would kick in, just sometimes the first few pixel lines of
> the next log message's (XEN) prefix would have made it out to the screen
> (and there's no serial there). It's been a lot of experimenting until I
> figured the workaround (which I consider ugly, but halfway acceptable).
> I've been trying hard to make sure the workaround wouldn't be masking a
> real issue, yet I'm still wary of it possibly doing so ... My best guess
> at this point is that on these old IOMMUs the ignored bits 52...61
> aren't really ignored for present entries, but also aren't "reserved"
> enough to trigger faults. This guess is from having tried to set other
> bits in this range (unconditionally, and with the workaround here in
> place), which yielded the same behavior.

Should we take Kevin's Reviewed-by as a heads up that bits 52..61 on
some? IOMMUs are not usable?

Would be good if we could get a more formal response I think.

> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -43,6 +43,9 @@
>  #include "vtd.h"
>  #include "../ats.h"
>  
> +#define CONTIG_MASK DMA_PTE_CONTIG_MASK
> +#include <asm/pt-contig-markers.h>
> +
>  /* dom_io is used as a sentinel for quarantined devices */
>  #define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
>  #define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
> @@ -405,6 +408,9 @@ static uint64_t addr_to_dma_page_maddr(s
>  
>              write_atomic(&pte->val, new_pte.val);
>              iommu_sync_cache(pte, sizeof(struct dma_pte));
> +            pt_update_contig_markers(&parent->val,
> +                                     address_level_offset(addr, level),

I think (unless previous patches in the series have changed this)
there already is an 'offset' local variable that you could use.

> +                                     level, PTE_kind_table);
>          }
>  
>          if ( --level == target )
> @@ -837,9 +843,31 @@ static int dma_pte_clear_one(struct doma
>  
>      old = *pte;
>      dma_clear_pte(*pte);
> +    iommu_sync_cache(pte, sizeof(*pte));
> +
> +    while ( pt_update_contig_markers(&page->val,
> +                                     address_level_offset(addr, level),
> +                                     level, PTE_kind_null) &&
> +            ++level < min_pt_levels )
> +    {
> +        struct page_info *pg = maddr_to_page(pg_maddr);
> +
> +        unmap_vtd_domain_page(page);
> +
> +        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
> +                                          false);
> +        BUG_ON(pg_maddr < PAGE_SIZE);
> +
> +        page = map_vtd_domain_page(pg_maddr);
> +        pte = &page[address_level_offset(addr, level)];
> +        dma_clear_pte(*pte);
> +        iommu_sync_cache(pte, sizeof(*pte));
> +
> +        *flush_flags |= IOMMU_FLUSHF_all;
> +        iommu_queue_free_pgtable(hd, pg);
> +    }

I think I'm setting myself for trouble, but do we need to sync cache
the lower lever entries if higher level ones are to be changed.

IOW, would it be fine to just flush the highest level modified PTE?
As the lower lever ones won't be reachable anyway.

>      spin_unlock(&hd->arch.mapping_lock);
> -    iommu_sync_cache(pte, sizeof(struct dma_pte));
>  
>      unmap_vtd_domain_page(page);
>  
> @@ -2182,8 +2210,21 @@ static int __must_check cf_check intel_i
>      }
>  
>      *pte = new;
> -
>      iommu_sync_cache(pte, sizeof(struct dma_pte));
> +
> +    /*
> +     * While the (ab)use of PTE_kind_table here allows to save some work in
> +     * the function, the main motivation for it is that it avoids a so far
> +     * unexplained hang during boot (while preparing Dom0) on a Westmere
> +     * based laptop.
> +     */
> +    pt_update_contig_markers(&page->val,
> +                             address_level_offset(dfn_to_daddr(dfn), level),
> +                             level,
> +                             (hd->platform_ops->page_sizes &
> +                              (1UL << level_to_offset_bits(level + 1))
> +                              ? PTE_kind_leaf : PTE_kind_table));

So this works because on what we believe to be affected models the
only supported page sizes are 4K?

Do we want to do the same with AMD if we don't allow 512G super pages?

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.