[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0




On 17.09.21 00:30, Stefano Stabellini wrote:

Hi Stefano

On Thu, 16 Sep 2021, Oleksandr wrote:
On Wed, 15 Sep 2021, Oleksandr wrote:
On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
From: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>

The extended region (safe range) is a region of guest physical
address space which is unused and could be safely used to create
grant/foreign mappings instead of wasting real RAM pages from
the domain memory for establishing these mappings.

The extended regions are chosen at the domain creation time and
advertised to it via "reg" property under hypervisor node in
the guest device-tree. As region 0 is reserved for grant table
space (always present), the indexes for extended regions are 1...N.
If extended regions could not be allocated for some reason,
Xen doesn't fail and behaves as usual, so only inserts region 0.

Please note the following limitations:
- The extended region feature is only supported for 64-bit domain.
- The ACPI case is not covered.

***

As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
the algorithm to choose extended regions for it is different
in comparison with the algorithm for non-direct mapped DomU.
What is more, that extended regions should be chosen differently
whether IOMMU is enabled or not.

Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
holes found in host device-tree if otherwise. Make sure that
extended regions are 2MB-aligned and located within maximum possible
addressable physical memory range. The maximum number of extended
regions is 128.

Suggested-by: Julien Grall <jgrall@xxxxxxxxxx>
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>
---
Changes since RFC:
      - update patch description
      - drop uneeded "extended-region" DT property
---

    xen/arch/arm/domain_build.c | 226
+++++++++++++++++++++++++++++++++++++++++++-
    1 file changed, 224 insertions(+), 2 deletions(-)

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 206038d..070ec27 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -724,6 +724,196 @@ static int __init make_memory_node(const struct
domain *d,
        return res;
    }
    +static int __init add_ext_regions(unsigned long s, unsigned long
e,
void *data)
+{
+    struct meminfo *ext_regions = data;
+    paddr_t start, size;
+
+    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
+        return 0;
+
+    /* Both start and size of the extended region should be 2MB
aligned
*/
+    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
+    if ( start > e )
+        return 0;
+
+    size = (e - start + 1) & ~(SZ_2M - 1);
+    if ( !size )
+        return 0;
Can't you align size as well?

     size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
I am sorry, I don't entirely get what you really meant here. We get both
start
and size 2MB-aligned by the calculations above
(when calculating an alignment, we need to make sure that "start_passed <=
start_aligned && size_aligned <= size_passed").
If I add the proposing string after, I will reduce the already aligned
size by
2MB.
If I replace the size calculation with the following, I will get the
reduced
size even if the passed region is initially 2MB-aligned, so doesn't need
to be
adjusted.
size = e - s + 1;
size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
Sorry I misread your original code, I think it was working as intended
except for the "+1". I think it should be:

    size = (e - start) & ~(SZ_2M - 1);
But why without "+1"? Isn't "e" here the *last address* of passed range?
Without "+1" I get non entirely correct calculations, last valid 2MB is
missed.
You are right: the "+1" should not be needed if this was "end",
following the normal definition of end. However, add_ext_regions is
called by rangeset_report_ranges, so end here is not actually "end", it
is "end-1".

Yes.



For clarity, I would ask you to rewrite it like this:

/*
  * e is actually "end-1" because it is called by rangeset functions
  * which are inclusive of the last address.
  */
e += 1;
size = (e - start) & ~(SZ_2M - 1);

Ack, will do.




[snip]
(XEN) Extended region 14: 0x580000000->0x5ffe00000
(XEN) Extended region 15: 0x680000000->0x6ffe00000
(XEN) Extended region 16: 0x780000000->0xffffe00000

But should get:

[snip]
(XEN) Extended region 15: 0x580000000->0x600000000
(XEN) Extended region 16: 0x680000000->0x700000000
(XEN) Extended region 17: 0x780000000->0x10000000000

Let's consider how a hole between (for example) RAM bank 1 and bank 2 is
calculated:
(XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1 with size
0x80000000
(XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2 with size
0x80000000
So the hole size should also be 0x80000000.
If we pass these RAM banks to rangeset_remove_range() one by one:
1: s = 0x500000000 e = 0x57FFFFFFF
2. s = 0x600000000 e = 0x67FFFFFFF
we get s = 0x580000000 e = 0x5FFFFFFFF in add_ext_regions(), where "e" is the
last address of the hole (not the first address out of the hole), so I think,
that for proper size calculation we need to add 1 to "e - s". Or I really
missed something?


+ */
+#define EXT_REGION_START   0x40000000ULL
+#define EXT_REGION_END     0x80003fffffffULL
+
+static int __init find_unallocated_memory(const struct kernel_info
*kinfo,
+                                          struct meminfo
*ext_regions)
+{
+    const struct meminfo *assign_mem = &kinfo->mem;
+    struct rangeset *unalloc_mem;
+    paddr_t start, end;
+    unsigned int i;
+    int res;
+
+    dt_dprintk("Find unallocated memory for extended regions\n");
+
+    unalloc_mem = rangeset_new(NULL, NULL, 0);
+    if ( !unalloc_mem )
+        return -ENOMEM;
+
+    /* Start with all available RAM */
+    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
+    {
+        start = bootinfo.mem.bank[i].start;
+        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size
- 1;
Is the -1 needed? Isn't it going to screw up the size calculation later?
I thought, it was needed. The calculation seems correct.
I think that normally for an example MMIO region:

start = 0x48000000
size  = 0x40000000
end   = 0x88000000

So end = start + size and points to the first address out of the range.
In other words, 0x88000000 doesn't actually belong to the MMIO region in
the example.

But here you are passing addresses to rangeset_add_range and other
rangeset functions and I think rangeset takes *inclusive* addresses as
input. So you need to pass start and end-1 because end-1 is the last
address of the MMIO region.

In fact you can see for instance in map_range_to_domain:

          res = iomem_permit_access(d, paddr_to_pfn(addr),
                  paddr_to_pfn(PAGE_ALIGN(addr + len - 1)));

Where iomem_permit_access is based on rangeset. So for clarity, I would
do:

start = assign_mem->bank[i].start;
end = assign_mem->bank[i].start + assign_mem->bank[i].size;
res = rangeset_remove_range(unalloc_mem, start, end - 1);

So that we don't get confused on the meaning of "end" which everywhere
else means the first address not in range.
I got your point, I will update the code if it much cleaner.


+        res = rangeset_add_range(unalloc_mem, start, end);
+        if ( res )
+        {
+            printk(XENLOG_ERR "Failed to add:
%#"PRIx64"->%#"PRIx64"\n",
+                   start, end);
+            goto out;
+        }
+    }
+
+    /* Remove RAM assigned to Dom0 */
+    for ( i = 0; i < assign_mem->nr_banks; i++ )
+    {
+        start = assign_mem->bank[i].start;
+        end = assign_mem->bank[i].start + assign_mem->bank[i].size -
1;
+        res = rangeset_remove_range(unalloc_mem, start, end);
+        if ( res )
+        {
+            printk(XENLOG_ERR "Failed to remove:
%#"PRIx64"->%#"PRIx64"\n",
+                   start, end);
+            goto out;
+        }
+    }
+
+    /* Remove reserved-memory regions */
+    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
+    {
+        start = bootinfo.reserved_mem.bank[i].start;
+        end = bootinfo.reserved_mem.bank[i].start +
+            bootinfo.reserved_mem.bank[i].size - 1;
+        res = rangeset_remove_range(unalloc_mem, start, end);
+        if ( res )
+        {
+            printk(XENLOG_ERR "Failed to remove:
%#"PRIx64"->%#"PRIx64"\n",
+                   start, end);
+            goto out;
+        }
+    }
+
+    /* Remove grant table region */
+    start = kinfo->gnttab_start;
+    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
+    res = rangeset_remove_range(unalloc_mem, start, end);
+    if ( res )
+    {
+        printk(XENLOG_ERR "Failed to remove:
%#"PRIx64"->%#"PRIx64"\n",
+               start, end);
+        goto out;
+    }
+
+    start = EXT_REGION_START;
+    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
+    res = rangeset_report_ranges(unalloc_mem, start, end,
+                                 add_ext_regions, ext_regions);
+    if ( res )
+        ext_regions->nr_banks = 0;
+    else if ( !ext_regions->nr_banks )
+        res = -ENOENT;
+
+out:
+    rangeset_destroy(unalloc_mem);
+
+    return res;
+}
+
+static int __init find_memory_holes(const struct kernel_info *kinfo,
+                                    struct meminfo *ext_regions)
+{
+    struct dt_device_node *np;
+    struct rangeset *mem_holes;
+    paddr_t start, end;
+    unsigned int i;
+    int res;
+
+    dt_dprintk("Find memory holes for extended regions\n");
+
+    mem_holes = rangeset_new(NULL, NULL, 0);
+    if ( !mem_holes )
+        return -ENOMEM;
+
+    /* Start with maximum possible addressable physical memory range
*/
+    start = EXT_REGION_START;
+    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
+    res = rangeset_add_range(mem_holes, start, end);
+    if ( res )
+    {
+        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
+               start, end);
+        goto out;
+    }
+
+    /* Remove all regions described by "reg" property (MMIO, RAM,
etc) */
+    dt_for_each_device_node( dt_host, np )
Don't you need something like device_tree_for_each_node ?
dt_for_each_device_node won't go down any deeper in the tree?
Thank you for pointing this out, I will investigate and update the patch.


Alternatively, maybe we could simply record the highest possible address
of any memory/device/anything as we scan the device tree with
handle_node. Then we can use that as the starting point here.
I also don't like the idea to scan the DT much, but I failed to find an
effective solution how to avoid that.
Yes, we can record the highest possible address, but I am afraid, I didn't
entirely get a suggestion. Is the suggestion to provide a single region
starting from highest possible address + 1 and up to the EXT_REGION_END
suitably aligned? Could you please clarify?
Yes, that is what I was suggesting as a possible alternative: start from
the highest possible address in DT + 1 and up to the EXT_REGION_END
suitably aligned. But that wouldn't solve the <4GB issue.

+                goto out;
+            }
+
+            start = addr & PAGE_MASK;
+            end = PAGE_ALIGN(addr + size) - 1;
+            res = rangeset_remove_range(mem_holes, start, end);
+            if ( res )
+            {
+                printk(XENLOG_ERR "Failed to remove:
%#"PRIx64"->%#"PRIx64"\n",
+                       start, end);
+                goto out;
+            }
+        }
+    }
As is, it will result in a myriad of small ranges which is unuseful and
slow to parse. I suggest to simplify it by removing a larger region than
strictly necessary. For instance, you could remove a 1GB-aligned and
1GB-multiple region for each range. That way, you are going to get fewer
large free ranges instance of many small ones which we don't need.
I agree with you that a lot of small ranges increase the bookkeeping in
Dom0
and there is also a theoretical (?) possibility
that small ranges occupy all space we provide for extended regions
(NR_MEM_BANKS)...
But, let's consider my setup as an example again, but when the IOMMU is
enabled for Dom0 ("holes found in DT").

--- The RAM configuration is the same:

(XEN) RAM: 0000000048000000 - 00000000bfffffff <--- RAM bank 0
(XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1
(XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2
(XEN) RAM: 0000000700000000 - 000000077fffffff <--- RAM bank 3

--- There are a lot of various platform devices with reg property
described in
DT, I will probably not post all IO ranges here, just say that mostly all
of
them to be mapped at 0xE0000000-0xFFFFFFFF.

--- As we only pick up ranges with size >= 2MB, the calculated extended
regions are (based on 40-bit IPA):

(XEN) Extended region 0: 0x40000000->0x47e00000
(XEN) Extended region 1: 0xc0000000->0xe6000000
(XEN) Extended region 2: 0xe7000000->0xe7200000
(XEN) Extended region 3: 0xe7400000->0xe7600000
(XEN) Extended region 4: 0xe7800000->0xec000000
(XEN) Extended region 5: 0xec200000->0xec400000
(XEN) Extended region 6: 0xec800000->0xee000000
(XEN) Extended region 7: 0xee600000->0xee800000
(XEN) Extended region 8: 0xeea00000->0xf1000000
(XEN) Extended region 9: 0xf1200000->0xfd000000
(XEN) Extended region 10: 0xfd200000->0xfd800000
(XEN) Extended region 11: 0xfda00000->0xfe000000
(XEN) Extended region 12: 0xfe200000->0xfe600000
(XEN) Extended region 13: 0xfec00000->0xff800000
(XEN) Extended region 14: 0x100000000->0x500000000
(XEN) Extended region 15: 0x580000000->0x600000000
(XEN) Extended region 16: 0x680000000->0x700000000
(XEN) Extended region 17: 0x780000000->0x10000000000

So, if I *correctly* understood your idea about removing 1GB-aligned
1GB-multiple region for each range we would get the following:

(XEN) Extended region 0: 0x100000000->0x500000000
(XEN) Extended region 1: 0x580000000->0x600000000
(XEN) Extended region 2: 0x680000000->0x700000000
(XEN) Extended region 3: 0x780000000->0x10000000000

As you can see there are no extended regions below 4GB at all. I assume,
it
would be good to provide them for 1:1 mapped Dom0 (for 32-bit DMA
devices?)
What else worries me is that IPA size could be 36 or even 32. So, I am
afraid,
we might even fail to find extended regions above 4GB.


I think, if 2MB is considered small enough to bother with, probably we
should
go with something in between (16MB, 32MB, 64MB).
For example, we can take into the account ranges with size >= 16MB:

(XEN) Extended region 0: 0x40000000->0x47e00000
(XEN) Extended region 1: 0xc0000000->0xe6000000
(XEN) Extended region 2: 0xe7800000->0xec000000
(XEN) Extended region 3: 0xec800000->0xee000000
(XEN) Extended region 4: 0xeea00000->0xf1000000
(XEN) Extended region 5: 0xf1200000->0xfd000000
(XEN) Extended region 6: 0x100000000->0x500000000
(XEN) Extended region 7: 0x580000000->0x600000000
(XEN) Extended region 8: 0x680000000->0x700000000
(XEN) Extended region 9: 0x780000000->0x10000000000

Any thoughts?
Yeah maybe an intermediate value would be best. I'd go with 64MB.
I completely agree.

So what I got on my setup with that value.

1. IOMMU is enabled for Dom0:

(XEN) Extended region 0: 0x40000000->0x47e00000
(XEN) Extended region 1: 0xc0000000->0xe6000000
(XEN) Extended region 2: 0xe7800000->0xec000000
(XEN) Extended region 3: 0xf1200000->0xfd000000
(XEN) Extended region 4: 0x100000000->0x500000000
(XEN) Extended region 5: 0x580000000->0x600000000
(XEN) Extended region 6: 0x680000000->0x700000000
(XEN) Extended region 7: 0x780000000->0x10000000000

2. IOMMU is disabled for Dom0:

(XEN) Extended region 0: 0x48000000->0x54000000
(XEN) Extended region 1: 0x57000000->0x60000000
(XEN) Extended region 2: 0x70000000->0x78000000
(XEN) Extended region 3: 0x78200000->0xc0000000
(XEN) Extended region 4: 0x500000000->0x580000000
(XEN) Extended region 5: 0x600000000->0x680000000
(XEN) Extended region 6: 0x700000000->0x780000000

Which is not bad.
Yeah I think that's good.

--
Regards,

Oleksandr Tyshchenko




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.