[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0



On Thu, 16 Sep 2021, Oleksandr wrote:
> > On Wed, 15 Sep 2021, Oleksandr wrote:
> > > > On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
> > > > > From: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>
> > > > > 
> > > > > The extended region (safe range) is a region of guest physical
> > > > > address space which is unused and could be safely used to create
> > > > > grant/foreign mappings instead of wasting real RAM pages from
> > > > > the domain memory for establishing these mappings.
> > > > > 
> > > > > The extended regions are chosen at the domain creation time and
> > > > > advertised to it via "reg" property under hypervisor node in
> > > > > the guest device-tree. As region 0 is reserved for grant table
> > > > > space (always present), the indexes for extended regions are 1...N.
> > > > > If extended regions could not be allocated for some reason,
> > > > > Xen doesn't fail and behaves as usual, so only inserts region 0.
> > > > > 
> > > > > Please note the following limitations:
> > > > > - The extended region feature is only supported for 64-bit domain.
> > > > > - The ACPI case is not covered.
> > > > > 
> > > > > ***
> > > > > 
> > > > > As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
> > > > > the algorithm to choose extended regions for it is different
> > > > > in comparison with the algorithm for non-direct mapped DomU.
> > > > > What is more, that extended regions should be chosen differently
> > > > > whether IOMMU is enabled or not.
> > > > > 
> > > > > Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
> > > > > holes found in host device-tree if otherwise. Make sure that
> > > > > extended regions are 2MB-aligned and located within maximum possible
> > > > > addressable physical memory range. The maximum number of extended
> > > > > regions is 128.
> > > > > 
> > > > > Suggested-by: Julien Grall <jgrall@xxxxxxxxxx>
> > > > > Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>
> > > > > ---
> > > > > Changes since RFC:
> > > > >      - update patch description
> > > > >      - drop uneeded "extended-region" DT property
> > > > > ---
> > > > > 
> > > > >    xen/arch/arm/domain_build.c | 226
> > > > > +++++++++++++++++++++++++++++++++++++++++++-
> > > > >    1 file changed, 224 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> > > > > index 206038d..070ec27 100644
> > > > > --- a/xen/arch/arm/domain_build.c
> > > > > +++ b/xen/arch/arm/domain_build.c
> > > > > @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct
> > > > > domain *d,
> > > > >        return res;
> > > > >    }
> > > > >    +static int __init add_ext_regions(unsigned long s, unsigned long
> > > > > e,
> > > > > void *data)
> > > > > +{
> > > > > +    struct meminfo *ext_regions = data;
> > > > > +    paddr_t start, size;
> > > > > +
> > > > > +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
> > > > > +        return 0;
> > > > > +
> > > > > +    /* Both start and size of the extended region should be 2MB
> > > > > aligned
> > > > > */
> > > > > +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
> > > > > +    if ( start > e )
> > > > > +        return 0;
> > > > > +
> > > > > +    size = (e - start + 1) & ~(SZ_2M - 1);
> > > > > +    if ( !size )
> > > > > +        return 0;
> > > > Can't you align size as well?
> > > > 
> > > >     size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
> > > I am sorry, I don't entirely get what you really meant here. We get both
> > > start
> > > and size 2MB-aligned by the calculations above
> > > (when calculating an alignment, we need to make sure that "start_passed <=
> > > start_aligned && size_aligned <= size_passed").
> > > If I add the proposing string after, I will reduce the already aligned
> > > size by
> > > 2MB.
> > > If I replace the size calculation with the following, I will get the
> > > reduced
> > > size even if the passed region is initially 2MB-aligned, so doesn't need
> > > to be
> > > adjusted.
> > > size = e - s + 1;
> > > size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
> > Sorry I misread your original code, I think it was working as intended
> > except for the "+1". I think it should be:
> > 
> >    size = (e - start) & ~(SZ_2M - 1);
> But why without "+1"? Isn't "e" here the *last address* of passed range?
> Without "+1" I get non entirely correct calculations, last valid 2MB is
> missed.

You are right: the "+1" should not be needed if this was "end",
following the normal definition of end. However, add_ext_regions is
called by rangeset_report_ranges, so end here is not actually "end", it
is "end-1".

For clarity, I would ask you to rewrite it like this:

/* 
 * e is actually "end-1" because it is called by rangeset functions
 * which are inclusive of the last address.
 */
e += 1;
size = (e - start) & ~(SZ_2M - 1);


> [snip]
> (XEN) Extended region 14: 0x580000000->0x5ffe00000
> (XEN) Extended region 15: 0x680000000->0x6ffe00000
> (XEN) Extended region 16: 0x780000000->0xffffe00000
> 
> But should get:
> 
> [snip]
> (XEN) Extended region 15: 0x580000000->0x600000000
> (XEN) Extended region 16: 0x680000000->0x700000000
> (XEN) Extended region 17: 0x780000000->0x10000000000
> 
> Let's consider how a hole between (for example) RAM bank 1 and bank 2 is
> calculated:
> (XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1 with size
> 0x80000000
> (XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2 with size
> 0x80000000
> So the hole size should also be 0x80000000.
> If we pass these RAM banks to rangeset_remove_range() one by one:
> 1: s = 0x500000000 e = 0x57FFFFFFF
> 2. s = 0x600000000 e = 0x67FFFFFFF
> we get s = 0x580000000 e = 0x5FFFFFFFF in add_ext_regions(), where "e" is the
> last address of the hole (not the first address out of the hole), so I think,
> that for proper size calculation we need to add 1 to "e - s". Or I really
> missed something?
> 
> 
> > 
> > > > > + */
> > > > > +#define EXT_REGION_START   0x40000000ULL
> > > > > +#define EXT_REGION_END     0x80003fffffffULL
> > > > > +
> > > > > +static int __init find_unallocated_memory(const struct kernel_info
> > > > > *kinfo,
> > > > > +                                          struct meminfo
> > > > > *ext_regions)
> > > > > +{
> > > > > +    const struct meminfo *assign_mem = &kinfo->mem;
> > > > > +    struct rangeset *unalloc_mem;
> > > > > +    paddr_t start, end;
> > > > > +    unsigned int i;
> > > > > +    int res;
> > > > > +
> > > > > +    dt_dprintk("Find unallocated memory for extended regions\n");
> > > > > +
> > > > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > > > +    if ( !unalloc_mem )
> > > > > +        return -ENOMEM;
> > > > > +
> > > > > +    /* Start with all available RAM */
> > > > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > > > +    {
> > > > > +        start = bootinfo.mem.bank[i].start;
> > > > > +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size
> > > > > - 1;
> > > > Is the -1 needed? Isn't it going to screw up the size calculation later?
> > > I thought, it was needed. The calculation seems correct.
> > I think that normally for an example MMIO region:
> > 
> > start = 0x48000000
> > size  = 0x40000000
> > end   = 0x88000000
> > 
> > So end = start + size and points to the first address out of the range.
> > In other words, 0x88000000 doesn't actually belong to the MMIO region in
> > the example.
> > 
> > But here you are passing addresses to rangeset_add_range and other
> > rangeset functions and I think rangeset takes *inclusive* addresses as
> > input. So you need to pass start and end-1 because end-1 is the last
> > address of the MMIO region.
> > 
> > In fact you can see for instance in map_range_to_domain:
> > 
> >          res = iomem_permit_access(d, paddr_to_pfn(addr),
> >                  paddr_to_pfn(PAGE_ALIGN(addr + len - 1)));
> > 
> > Where iomem_permit_access is based on rangeset. So for clarity, I would
> > do:
> > 
> > start = assign_mem->bank[i].start;
> > end = assign_mem->bank[i].start + assign_mem->bank[i].size;
> > res = rangeset_remove_range(unalloc_mem, start, end - 1);
> > 
> > So that we don't get confused on the meaning of "end" which everywhere
> > else means the first address not in range.
> 
> I got your point, I will update the code if it much cleaner.
> 
> 
> > > > > +        res = rangeset_add_range(unalloc_mem, start, end);
> > > > > +        if ( res )
> > > > > +        {
> > > > > +            printk(XENLOG_ERR "Failed to add:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +                   start, end);
> > > > > +            goto out;
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    /* Remove RAM assigned to Dom0 */
> > > > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > > > +    {
> > > > > +        start = assign_mem->bank[i].start;
> > > > > +        end = assign_mem->bank[i].start + assign_mem->bank[i].size -
> > > > > 1;
> > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > +        if ( res )
> > > > > +        {
> > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +                   start, end);
> > > > > +            goto out;
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    /* Remove reserved-memory regions */
> > > > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > > > +    {
> > > > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > +        if ( res )
> > > > > +        {
> > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +                   start, end);
> > > > > +            goto out;
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    /* Remove grant table region */
> > > > > +    start = kinfo->gnttab_start;
> > > > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > +    if ( res )
> > > > > +    {
> > > > > +        printk(XENLOG_ERR "Failed to remove:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +               start, end);
> > > > > +        goto out;
> > > > > +    }
> > > > > +
> > > > > +    start = EXT_REGION_START;
> > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > > > +                                 add_ext_regions, ext_regions);
> > > > > +    if ( res )
> > > > > +        ext_regions->nr_banks = 0;
> > > > > +    else if ( !ext_regions->nr_banks )
> > > > > +        res = -ENOENT;
> > > > > +
> > > > > +out:
> > > > > +    rangeset_destroy(unalloc_mem);
> > > > > +
> > > > > +    return res;
> > > > > +}
> > > > > +
> > > > > +static int __init find_memory_holes(const struct kernel_info *kinfo,
> > > > > +                                    struct meminfo *ext_regions)
> > > > > +{
> > > > > +    struct dt_device_node *np;
> > > > > +    struct rangeset *mem_holes;
> > > > > +    paddr_t start, end;
> > > > > +    unsigned int i;
> > > > > +    int res;
> > > > > +
> > > > > +    dt_dprintk("Find memory holes for extended regions\n");
> > > > > +
> > > > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > > > +    if ( !mem_holes )
> > > > > +        return -ENOMEM;
> > > > > +
> > > > > +    /* Start with maximum possible addressable physical memory range
> > > > > */
> > > > > +    start = EXT_REGION_START;
> > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > +    res = rangeset_add_range(mem_holes, start, end);
> > > > > +    if ( res )
> > > > > +    {
> > > > > +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > > > > +               start, end);
> > > > > +        goto out;
> > > > > +    }
> > > > > +
> > > > > +    /* Remove all regions described by "reg" property (MMIO, RAM,
> > > > > etc) */
> > > > > +    dt_for_each_device_node( dt_host, np )
> > > > Don't you need something like device_tree_for_each_node ?
> > > > dt_for_each_device_node won't go down any deeper in the tree?
> > > Thank you for pointing this out, I will investigate and update the patch.
> > > 
> > > 
> > > > Alternatively, maybe we could simply record the highest possible address
> > > > of any memory/device/anything as we scan the device tree with
> > > > handle_node. Then we can use that as the starting point here.
> > > I also don't like the idea to scan the DT much, but I failed to find an
> > > effective solution how to avoid that.
> > > Yes, we can record the highest possible address, but I am afraid, I didn't
> > > entirely get a suggestion. Is the suggestion to provide a single region
> > > starting from highest possible address + 1 and up to the EXT_REGION_END
> > > suitably aligned? Could you please clarify?
> > Yes, that is what I was suggesting as a possible alternative: start from
> > the highest possible address in DT + 1 and up to the EXT_REGION_END
> > suitably aligned. But that wouldn't solve the <4GB issue.
> > 
> > > > > +                goto out;
> > > > > +            }
> > > > > +
> > > > > +            start = addr & PAGE_MASK;
> > > > > +            end = PAGE_ALIGN(addr + size) - 1;
> > > > > +            res = rangeset_remove_range(mem_holes, start, end);
> > > > > +            if ( res )
> > > > > +            {
> > > > > +                printk(XENLOG_ERR "Failed to remove:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +                       start, end);
> > > > > +                goto out;
> > > > > +            }
> > > > > +        }
> > > > > +    }
> > > > As is, it will result in a myriad of small ranges which is unuseful and
> > > > slow to parse. I suggest to simplify it by removing a larger region than
> > > > strictly necessary. For instance, you could remove a 1GB-aligned and
> > > > 1GB-multiple region for each range. That way, you are going to get fewer
> > > > large free ranges instance of many small ones which we don't need.
> > > I agree with you that a lot of small ranges increase the bookkeeping in
> > > Dom0
> > > and there is also a theoretical (?) possibility
> > > that small ranges occupy all space we provide for extended regions
> > > (NR_MEM_BANKS)...
> > > But, let's consider my setup as an example again, but when the IOMMU is
> > > enabled for Dom0 ("holes found in DT").
> > > 
> > > --- The RAM configuration is the same:
> > > 
> > > (XEN) RAM: 0000000048000000 - 00000000bfffffff <--- RAM bank 0
> > > (XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1
> > > (XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2
> > > (XEN) RAM: 0000000700000000 - 000000077fffffff <--- RAM bank 3
> > > 
> > > --- There are a lot of various platform devices with reg property
> > > described in
> > > DT, I will probably not post all IO ranges here, just say that mostly all
> > > of
> > > them to be mapped at 0xE0000000-0xFFFFFFFF.
> > > 
> > > --- As we only pick up ranges with size >= 2MB, the calculated extended
> > > regions are (based on 40-bit IPA):
> > > 
> > > (XEN) Extended region 0: 0x40000000->0x47e00000
> > > (XEN) Extended region 1: 0xc0000000->0xe6000000
> > > (XEN) Extended region 2: 0xe7000000->0xe7200000
> > > (XEN) Extended region 3: 0xe7400000->0xe7600000
> > > (XEN) Extended region 4: 0xe7800000->0xec000000
> > > (XEN) Extended region 5: 0xec200000->0xec400000
> > > (XEN) Extended region 6: 0xec800000->0xee000000
> > > (XEN) Extended region 7: 0xee600000->0xee800000
> > > (XEN) Extended region 8: 0xeea00000->0xf1000000
> > > (XEN) Extended region 9: 0xf1200000->0xfd000000
> > > (XEN) Extended region 10: 0xfd200000->0xfd800000
> > > (XEN) Extended region 11: 0xfda00000->0xfe000000
> > > (XEN) Extended region 12: 0xfe200000->0xfe600000
> > > (XEN) Extended region 13: 0xfec00000->0xff800000
> > > (XEN) Extended region 14: 0x100000000->0x500000000
> > > (XEN) Extended region 15: 0x580000000->0x600000000
> > > (XEN) Extended region 16: 0x680000000->0x700000000
> > > (XEN) Extended region 17: 0x780000000->0x10000000000
> > > 
> > > So, if I *correctly* understood your idea about removing 1GB-aligned
> > > 1GB-multiple region for each range we would get the following:
> > > 
> > > (XEN) Extended region 0: 0x100000000->0x500000000
> > > (XEN) Extended region 1: 0x580000000->0x600000000
> > > (XEN) Extended region 2: 0x680000000->0x700000000
> > > (XEN) Extended region 3: 0x780000000->0x10000000000
> > > 
> > > As you can see there are no extended regions below 4GB at all. I assume,
> > > it
> > > would be good to provide them for 1:1 mapped Dom0 (for 32-bit DMA
> > > devices?)
> > > What else worries me is that IPA size could be 36 or even 32. So, I am
> > > afraid,
> > > we might even fail to find extended regions above 4GB.
> > > 
> > > 
> > > I think, if 2MB is considered small enough to bother with, probably we
> > > should
> > > go with something in between (16MB, 32MB, 64MB).
> > > For example, we can take into the account ranges with size >= 16MB:
> > > 
> > > (XEN) Extended region 0: 0x40000000->0x47e00000
> > > (XEN) Extended region 1: 0xc0000000->0xe6000000
> > > (XEN) Extended region 2: 0xe7800000->0xec000000
> > > (XEN) Extended region 3: 0xec800000->0xee000000
> > > (XEN) Extended region 4: 0xeea00000->0xf1000000
> > > (XEN) Extended region 5: 0xf1200000->0xfd000000
> > > (XEN) Extended region 6: 0x100000000->0x500000000
> > > (XEN) Extended region 7: 0x580000000->0x600000000
> > > (XEN) Extended region 8: 0x680000000->0x700000000
> > > (XEN) Extended region 9: 0x780000000->0x10000000000
> > > 
> > > Any thoughts?
> > Yeah maybe an intermediate value would be best. I'd go with 64MB.
> 
> I completely agree.
> 
> So what I got on my setup with that value.
> 
> 1. IOMMU is enabled for Dom0:
> 
> (XEN) Extended region 0: 0x40000000->0x47e00000
> (XEN) Extended region 1: 0xc0000000->0xe6000000
> (XEN) Extended region 2: 0xe7800000->0xec000000
> (XEN) Extended region 3: 0xf1200000->0xfd000000
> (XEN) Extended region 4: 0x100000000->0x500000000
> (XEN) Extended region 5: 0x580000000->0x600000000
> (XEN) Extended region 6: 0x680000000->0x700000000
> (XEN) Extended region 7: 0x780000000->0x10000000000
> 
> 2. IOMMU is disabled for Dom0:
> 
> (XEN) Extended region 0: 0x48000000->0x54000000
> (XEN) Extended region 1: 0x57000000->0x60000000
> (XEN) Extended region 2: 0x70000000->0x78000000
> (XEN) Extended region 3: 0x78200000->0xc0000000
> (XEN) Extended region 4: 0x500000000->0x580000000
> (XEN) Extended region 5: 0x600000000->0x680000000
> (XEN) Extended region 6: 0x700000000->0x780000000
> 
> Which is not bad.

Yeah I think that's good.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.