[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 02/18] VT-d: have callers specify the target level for page table walks


  • To: Roger Pau Monné <roger.pau@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 27 Sep 2021 11:04:26 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=XsFCFAVp28fbkgmW0sLX6BCKQZCn6WqvaZwT8rsKZf8=; b=SC4igtGScXi284bsTrUOfDeUaZ1BQ7pK53SlxvcA2dqTWKqGRvXFeS8OxoDz2Ku5JT6mGISiLbxOf7k5Y1Fy3Ahp9FB7NGEh3dk4qBD+cNHqMuUQDO9xXPHdejKsH0xCEnGTRhofCnQTklSYcrrE3BLTmii2heKSHR85aUvwVMzUhbIc7SByFZpeSr8yQDyvt0Du941jNBr80JEmeaI9y09dZzhRjuovxM05tx6nrfS/PWmkflV21LulX+AB1CB3lo8m0qrJ/NH6n1HCzQgD+tU9+ivRoXJCEVwZpnLjz1sv3ndTbtVAm1oeeD4DFOFjoIkMlvVBK6N60cySmz8gFQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oWu+ASa4VKwokixrn3Uvnjq+DCTfN/u3GAJbOufsOdhiazjNeZIuL4QpMN2IN8xPL46Ml92AsnfxlLYrl3zX4802on1GDcTzihrlxTn+91JKiNizYL/YhfHz3Gzc6NrDE72pR26VJa/AO/E+FzTorROzS1/eq61kLBrqPqy16K1Q2nykZ5MjKT/myLuFKStiEqP1wxbyif2N4rltRkndz70XnMFz/+KpooHFLXdEk0e0UzP95gbG6o4oeNNXQdOtUGruDcfgrmJTsmB4sm/0bBnLGKqXHPYFIqvIQFebkU+6DHYObCCqv5G9HxTAAZU3bkllBPhz4i51DsHEeIuLyA==
  • Authentication-results: xen.org; dkim=none (message not signed) header.d=none;xen.org; dmarc=none action=none header.from=suse.com;
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>
  • Delivery-date: Mon, 27 Sep 2021 09:04:38 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 24.09.2021 16:45, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:42:13AM +0200, Jan Beulich wrote:
>> In order to be able to insert/remove super-pages we need to allow
>> callers of the walking function to specify at which point to stop the
>> walk.
>>
>> For intel_iommu_lookup_page() integrate the last level access into
>> the main walking function.
>>
>> dma_pte_clear_one() gets only partly adjusted for now: Error handling
>> and order parameter get put in place, but the order parameter remains
>> ignored (just like intel_iommu_map_page()'s order part of the flags).
>>
>> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
>> ---
>> I have to admit that I don't understand why domain_pgd_maddr() wants to
>> populate all page table levels for DFN 0.
> 
> I think it would be enough to create up to the level requested by the
> caller?
> 
> Seems like a lazy way to always assert that the level requested by the
> caller would be present.

The caller doesn't request any level here. What the caller passes in
is the number of levels the respective IOMMU can deal with (varying
of which across all IOMMUs is somewhat funny anyway). Hence I _guess_
that it would really be sufficient to install as many levels as are
necessary for the loop at the end of the function to complete
successfully. Full depth population then would have happened only
because until here addr_to_dma_page_maddr() didn't have a way to
limit the number of levels. But then the comment is misleading. As
I'm merely guessing here, I'm still hoping for Kevin to have (and
share) some insight.

>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -264,63 +264,116 @@ static u64 bus_to_context_maddr(struct v
>>      return maddr;
>>  }
>>  
>> -static u64 addr_to_dma_page_maddr(struct domain *domain, u64 addr, int 
>> alloc)
>> +/*
>> + * This function walks (and if requested allocates) page tables to the
>> + * designated target level. It returns
>> + * - 0 when a non-present entry was encountered and no allocation was
>> + *   requested,
>> + * - a small positive value (the level, i.e. below PAGE_SIZE) upon 
>> allocation
>> + *   failure,
>> + * - for target > 0 the address of the page table holding the leaf PTE for
>                           ^ physical
> 
> I think it's clearer, as the return type could be ambiguous.

Added.

>> + *   the requested address,
>> + * - for target == 0 the full PTE.
> 
> Could this create confusion if for example one PTE maps physical page
> 0? As the caller when getting back a full PTE with address 0 and some of
> the low bits set could interpret the result as an error.
> 
> I think we already had this discussion on other occasions, but I would
> rather add a parameter to be used as a return placeholder (ie: a
> *dma_pte maybe?) and use the function return value just for errors
> because IMO it's clearer, but I know you don't usually like this
> approach, so I'm not going to insist.

MFN 0 is never used for anything. This in particular includes it not
getting used as a page table.

>> +static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr,
>> +                                       unsigned int target,
>> +                                       unsigned int *flush_flags, bool 
>> alloc)
>>  {
>>      struct domain_iommu *hd = dom_iommu(domain);
>>      int addr_width = agaw_to_width(hd->arch.vtd.agaw);
>>      struct dma_pte *parent, *pte = NULL;
>> -    int level = agaw_to_level(hd->arch.vtd.agaw);
>> -    int offset;
>> +    unsigned int level = agaw_to_level(hd->arch.vtd.agaw), offset;
>>      u64 pte_maddr = 0;
>>  
>>      addr &= (((u64)1) << addr_width) - 1;
>>      ASSERT(spin_is_locked(&hd->arch.mapping_lock));
>> +    ASSERT(target || !alloc);
> 
> Might be better to use an if with ASSERT_UNREACHABLE and return an
> error? (ie: level itself?)

I did consider this, but decided against because neither of the two
error indicators properly expresses that case. If you're concerned
of hitting the case in a release build, I'd rather switch to BUG_ON().

>> +
>>      if ( !hd->arch.vtd.pgd_maddr )
>>      {
>>          struct page_info *pg;
>>  
>> -        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
>> +        if ( !alloc )
>> +            goto out;
>> +
>> +        pte_maddr = level;
>> +        if ( !(pg = iommu_alloc_pgtable(domain)) )
>>              goto out;
>>  
>>          hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
>>      }
>>  
>> -    parent = (struct dma_pte *)map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
>> -    while ( level > 1 )
>> +    pte_maddr = hd->arch.vtd.pgd_maddr;
>> +    parent = map_vtd_domain_page(pte_maddr);
>> +    while ( level > target )
>>      {
>>          offset = address_level_offset(addr, level);
>>          pte = &parent[offset];
>>  
>>          pte_maddr = dma_pte_addr(*pte);
>> -        if ( !pte_maddr )
>> +        if ( !dma_pte_present(*pte) || (level > 1 && 
>> dma_pte_superpage(*pte)) )
>>          {
>>              struct page_info *pg;
>> +            /*
>> +             * Higher level tables always set r/w, last level page table
>> +             * controls read/write.
>> +             */
>> +            struct dma_pte new_pte = { DMA_PTE_PROT };
>>  
>>              if ( !alloc )
>> -                break;
>> +            {
>> +                pte_maddr = 0;
>> +                if ( !dma_pte_present(*pte) )
>> +                    break;
>> +
>> +                /*
>> +                 * When the leaf entry was requested, pass back the full 
>> PTE,
>> +                 * with the address adjusted to account for the residual of
>> +                 * the walk.
>> +                 */
>> +                pte_maddr = pte->val +
> 
> Wouldn't it be better to use dma_pte_addr(*pte) rather than accessing
> pte->val, and then you could drop the PAGE_MASK?
> 
> Or is the addr parameter not guaranteed to be page aligned?

addr is page aligned, but may not be superpage aligned. Yet that's not
the point here. As per the comment at the top of the function (and as
per the needs of intel_iommu_lookup_page()) we want to return a proper
(even if fake) PTE here, i.e. in particular including the access
control bits. Is "full" in the comment not sufficient to express this?

>> +                    (addr & ((1UL << level_to_offset_bits(level)) - 1) &
>> +                     PAGE_MASK);
>> +                if ( !target )
>> +                    break;
> 
> I'm confused by the conditional break here, why would you calculate
> pte_maddr unconditionally to get overwritten just the line below if
> target != 0?

That's simply for style reasons - calculating ahead of the if() allows
for less indentation and hence better flowing of the overall expression.

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.