[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH v2 07/26] ARM: GICv3 ITS: introduce host LPI array



Hi,

On 05/01/17 18:56, Stefano Stabellini wrote:
> On Thu, 22 Dec 2016, Andre Przywara wrote:
> 7> The number of LPIs on a host can be potentially huge (millions),
>> although in practise will be mostly reasonable. So prematurely allocating
>> an array of struct irq_desc's for each LPI is not an option.
>> However Xen itself does not care about LPIs, as every LPI will be injected
>> into a guest (Dom0 for now).
>> Create a dense data structure (8 Bytes) for each LPI which holds just
>> enough information to determine the virtual IRQ number and the VCPU into
>> which the LPI needs to be injected.
>> Also to not artificially limit the number of LPIs, we create a 2-level
>> table for holding those structures.
>> This patch introduces functions to initialize these tables and to
>> create, lookup and destroy entries for a given LPI.
>> We allocate and access LPI information in a way that does not require
>> a lock.
>>
>> Signed-off-by: Andre Przywara <andre.przywara@xxxxxxx>
>> ---
>>  xen/arch/arm/gic-its.c        | 233 
>> +++++++++++++++++++++++++++++++++++++++++-
>>  xen/include/asm-arm/gic-its.h |   1 +
>>  2 files changed, 233 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/arm/gic-its.c b/xen/arch/arm/gic-its.c
>> index e157c6b..e7ddd90 100644
>> --- a/xen/arch/arm/gic-its.c
>> +++ b/xen/arch/arm/gic-its.c
>> @@ -18,21 +18,36 @@
>>  
>>  #include <xen/config.h>
>>  #include <xen/lib.h>
>> +#include <xen/sched.h>
>>  #include <xen/err.h>
>>  #include <xen/device_tree.h>
>>  #include <xen/libfdt/libfdt.h>
>>  #include <xen/sched.h>
>>  #include <xen/sizes.h>
>>  #include <asm/p2m.h>
>> +#include <asm/domain.h>
>>  #include <asm/io.h>
>>  #include <asm/gic.h>
>>  #include <asm/gic_v3_defs.h>
>>  #include <asm/gic-its.h>
>>  
>> +/* LPIs on the host always go to a guest, so no struct irq_desc for them. */
>> +union host_lpi {
>> +    uint64_t data;
>> +    struct {
>> +        uint64_t virt_lpi:32;
>> +        uint64_t dom_id:16;
>> +        uint64_t vcpu_id:16;
>> +    };
>> +};
> 
> Just go with a regular struct
> 
>     struct host_lpi {
>         uint32_t virt_lpi;
>         uint16_t dom_id;
>         uint16_t vcpu_id;
>     };
> 
> The aarch64 C ABI guarantees the alignments of the fields.

Yes, I will get rid of the bitfields. But the actual purpose of the
union is to allow lock-free atomic access. I just see now that I failed
to document that, sorry!

We can't afford to have a lock for the actual data here, so the idea was
to use the naturally atomic access a native data type would give us.
In case we want to write multiple members, we assemble them in a local
copy and then write the uint64_t variable into the actual location.
Similar for reading. Single members can be updated using the members
directly.
Since the architecture guarantees atomic access for an aligned memory
access to/from a GPR, I think this is safe.
I am not sure we need to use the atomic_<type>_{read,write} accessors
here? I tried it: the resulting assembly is identical, and the source
doesn't look too bad either, so I guess I will change them over, just to
be safe?

Please note that this not about access ordering due to concurrent
accesses (so no barriers), we just need to guarantee that a host_lpi
state is consistent in respect to its members.

>>  /* Global state */
>>  static struct {
>>      uint8_t *lpi_property;
>> +    union host_lpi **host_lpis;
>>      unsigned int host_lpi_bits;
>> +    /* Protects allocation and deallocation of host LPIs, but not the 
>> access */
>> +    spinlock_t host_lpis_lock;
>>  } lpi_data;
>>  
>>  /* Physical redistributor address */
>> @@ -43,6 +58,19 @@ static DEFINE_PER_CPU(uint64_t, rdist_id);
>>  static DEFINE_PER_CPU(void *, pending_table);
>>  
>>  #define MAX_PHYS_LPIS   (BIT_ULL(lpi_data.host_lpi_bits) - 8192)
>> +#define HOST_LPIS_PER_PAGE      (PAGE_SIZE / sizeof(union host_lpi))
>> +
>> +static union host_lpi *gic_get_host_lpi(uint32_t plpi)
>> +{
>> +    if ( plpi < 8192 || plpi >= MAX_PHYS_LPIS + 8192 )
>> +        return NULL;
>> +
>> +    plpi -= 8192;
>> +    if ( !lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE] )
>> +        return NULL;
>> +
>> +    return &lpi_data.host_lpis[plpi / HOST_LPIS_PER_PAGE][plpi % 
>> HOST_LPIS_PER_PAGE];
>> +}
>>  
>>  #define ITS_CMD_QUEUE_SZ                                SZ_64K
>>  
>> @@ -96,6 +124,20 @@ static int its_send_cmd_sync(struct host_its *its, int 
>> cpu)
>>      return its_send_command(its, cmd);
>>  }
>>  
>> +static int its_send_cmd_mapti(struct host_its *its,
>> +                              uint32_t deviceid, uint32_t eventid,
>> +                              uint32_t pintid, uint16_t icid)
>> +{
>> +    uint64_t cmd[4];
>> +
>> +    cmd[0] = GITS_CMD_MAPTI | ((uint64_t)deviceid << 32);
>> +    cmd[1] = eventid | ((uint64_t)pintid << 32);
>> +    cmd[2] = icid;
>> +    cmd[3] = 0x00;
>> +
>> +    return its_send_command(its, cmd);
>> +}
>> +
>>  static int its_send_cmd_mapc(struct host_its *its, int collection_id, int 
>> cpu)
>>  {
>>      uint64_t cmd[4];
>> @@ -124,6 +166,19 @@ static int its_send_cmd_mapd(struct host_its *its, 
>> uint32_t deviceid,
>>      return its_send_command(its, cmd);
>>  }
>>  
>> +static int its_send_cmd_inv(struct host_its *its,
>> +                            uint32_t deviceid, uint32_t eventid)
>> +{
>> +    uint64_t cmd[4];
>> +
>> +    cmd[0] = GITS_CMD_INV | ((uint64_t)deviceid << 32);
>> +    cmd[1] = eventid;
>> +    cmd[2] = 0x00;
>> +    cmd[3] = 0x00;
>> +
>> +    return its_send_command(its, cmd);
>> +}
>> +
>>  /* Set up the (1:1) collection mapping for the given host CPU. */
>>  void gicv3_its_setup_collection(int cpu)
>>  {
>> @@ -366,21 +421,181 @@ uint64_t gicv3_lpi_get_proptable()
>>  static unsigned int max_lpi_bits = CONFIG_MAX_HOST_LPI_BITS;
>>  integer_param("max_lpi_bits", max_lpi_bits);
>>  
>> +/* Allocate the 2nd level array for host LPIs. This one holds pointers
>> + * to the page with the actual "union host_lpi" entries. Our LPI limit
>> + * avoids excessive memory usage.
>> + */
>>  int gicv3_lpi_init_host_lpis(unsigned int hw_lpi_bits)
>>  {
>> +    int nr_lpi_ptrs;
>> +
>>      lpi_data.host_lpi_bits = min(hw_lpi_bits, max_lpi_bits);
>>  
>> +    spin_lock_init(&lpi_data.host_lpis_lock);
>> +
>> +    nr_lpi_ptrs = MAX_PHYS_LPIS / (PAGE_SIZE / sizeof(union host_lpi));
>> +    lpi_data.host_lpis = xzalloc_array(union host_lpi *, nr_lpi_ptrs);
>> +    if ( !lpi_data.host_lpis )
>> +        return -ENOMEM;
>> +
>>      printk("GICv3: using at most %lld LPIs on the host.\n", MAX_PHYS_LPIS);
>>  
>>      return 0;
>>  }
>>  
>> +#define INVALID_DOMID ((uint16_t)~0)
>> +#define LPI_BLOCK       32
> 
> Why 32? What is the reason behind it? It should be explained in the
> commit message and in the code.

There are multiple reasons for 32:
1) maximum number of MSIs per device in (legacy) PCI devices
2) smallest reasonable size of ITT (256 Byte alignment with typically 8
Byte sized entries)
3) number of ITT entries the Linux kernel allocates normally

The previous version worked without using blocks, by allocating single
LPIs. Now our early device and LPI mapping mandates a number of events
to be allocated, which we happen to set to 32. Ideally this number would
be passed on from a guest or Dom0, but this would require changing the
(hypercall) interfaces. We may do this later, but for now just using 32
seems to be good enough.

> Please add some numbers of memory consumption and average number of
> iterations required to setup lpis on an example server system. It is
> important to understand if the current algorithm and data structs are
> sufficient for our needs.

OK, can do.

> With the new host_lpis data struct below, the memory consumption is
> 
>     ROUNDUP(nr_plpis * 8, 4096) +
>     (ROUNDUP(nr_plpis * 8, 4096) / 4096) * 8 +
>     (nr_events / 32) * 4

I am not sure I get the meanings of the lines here, since you mix
physical LPIs (on the host) with events (per mapped device in each domain).

For physical LPIs:
- One top level table using one pointer per (4K / sizeof(union
host_lpi)) LPIs, so for each 512 LPIs. The number of LPIs is the compile
time constant, possibly overridden via the command line. Default value
is 20 bits, so the top level table takes 16KB.
- For each actually allocated contiguous block of 512 LPIs another 4K
page to hold the actual union host_lpi entries. This is densely
allocated, so one 4K table covers 16 passed-through devices (with 32
LPIs á 8 bytes each), regardless of their device ID or virtual LPI. So
128 (PCI) devices take up 32 KB, for instance, the default maximum of
1024 devices needs 256KB (if every device is actually used).

This covers the data structures for our internal physical LPI forwarding
information.
Additionally the host ITS requires some memory, which is one ITT per
used device (32 events (=MSIs) á 8 bytes = 256 bytes) and nr_devices
(=1024 by default) * 8 Bytes for the device table. That adds up to 256K
(worst case) + 8K.

I guess I will either add the broken-down memory requirements into some
documentation file or into the design document.

> Which, assuming nr_events = nr_plpis = 1024, means 8192 + 16 + 128, right?
> 
> 
>> +/* Must be called with host_lpis_lock held. */
>> +static int find_unused_host_lpi(int start, uint32_t *index)
>> +{
>> +    int chunk;
>> +    uint32_t i = *index;
>> +
>> +    for ( chunk = start; chunk < MAX_PHYS_LPIS / HOST_LPIS_PER_PAGE; 
>> chunk++ )
>> +    {
>> +        /* If we hit an unallocated chunk, use entry 0 in that one. */
>> +        if ( !lpi_data.host_lpis[chunk] )
>> +        {
>> +            *index = 0;
>> +            return chunk;
>> +        }
>> +
>> +        /* Find an unallocated entry in this chunk. */
>> +        for ( ; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
>> +        {
>> +            if ( lpi_data.host_lpis[chunk][i].dom_id == INVALID_DOMID )
>> +            {
>> +                *index = i;
>> +                return chunk;
>> +            }
> 
> This assumes that plpis are always allocated 32 at a time, by not
> checking within a block.

Yes, this is fine as the only function to allocate a block (below) does
this in chunks of LPI_BLOCK.

>> +        }
>> +        i = 0;
>> +    }
>> +
>> +    return -1;
>> +}
>> +
>> +/*
>> + * Allocate a block of 32 LPIs on the given host ITS for device "devid",
>> + * starting with "eventid". Put them into the respective ITT by issuing a
>> + * MAPTI command for each of them.
>> + */
>> +static int gicv3_allocate_host_lpi_block(struct host_its *its, struct 
>> domain *d,
>> +                                         uint32_t host_devid, uint32_t 
>> eventid)
>> +{
>> +    static uint32_t next_lpi = 0;
>> +    uint32_t lpi, lpi_idx = next_lpi % HOST_LPIS_PER_PAGE;
>> +    int chunk;
>> +    int i;
>> +
>> +    spin_lock(&lpi_data.host_lpis_lock);
>> +    chunk = find_unused_host_lpi(next_lpi / HOST_LPIS_PER_PAGE, &lpi_idx);
>> +
>> +    if ( chunk == - 1 )          /* rescan for a hole from the beginning */
>> +    {
>> +        lpi_idx = 0;
>> +        chunk = find_unused_host_lpi(0, &lpi_idx);
>> +        if ( chunk == -1 )
>> +        {
>> +            spin_unlock(&lpi_data.host_lpis_lock);
>> +            return -ENOSPC;
>> +        }
>> +    }
>> +
>> +    /* If we hit an unallocated chunk, we initialize it and use entry 0. */
>> +    if ( !lpi_data.host_lpis[chunk] )
>> +    {
>> +        union host_lpi *new_chunk;
>> +
>> +        new_chunk = alloc_xenheap_pages(0, 0);
>> +        if ( !new_chunk )
>> +        {
>> +            spin_unlock(&lpi_data.host_lpis_lock);
>> +            return -ENOMEM;
>> +        }
>> +
>> +        for ( i = 0; i < HOST_LPIS_PER_PAGE; i += LPI_BLOCK )
>> +            new_chunk[i].dom_id = INVALID_DOMID;
> 
> Don't we need to properly initialize each entry in the block? I would
> use memset for initialization.

If the domain ID is invalid, the other two entries will always be
ignored. Only in case of a valid domID we ever look at the LPI number
and VCPU id.

>> +        lpi_data.host_lpis[chunk] = new_chunk;
>> +        lpi_idx = 0;
>> +    }
>> +
>> +    lpi = chunk * HOST_LPIS_PER_PAGE + lpi_idx;
>> +
>> +    for ( i = 0; i < LPI_BLOCK; i++ )
>> +    {
>> +        union host_lpi hlpi;
>> +
>> +        /*
>> +         * Mark this host LPI as belonging to the domain, but don't assign
>> +         * any virtual LPI or a VCPU yet.
>> +         */
>> +        hlpi.virt_lpi = 0;
> 
>  #define 0 to invalid vlpi
> 
> 
>> +        hlpi.dom_id = d->domain_id;
>> +        hlpi.vcpu_id = ~0;
> 
>  #define ~0 to invalid vcpu

Sure.

>> +        lpi_data.host_lpis[chunk][lpi_idx + i].data = hlpi.data;
>> +
>> +        /*
>> +         * Enable this host LPI, so we don't have to do this during the
>> +         * guest's runtime.
>> +         */
>> +        lpi_data.lpi_property[lpi + i] |= LPI_PROP_ENABLED;
>> +    }
>> +
>> +    /* We have allocated and initialized the host LPI entries, so it's safe
>> +     * to drop the lock now. Access to the structures can be done 
>> concurrently
>> +     * as it involves only an atomic uint64_t access.
>> +     */
>> +    spin_unlock(&lpi_data.host_lpis_lock);
>> +
>> +    __flush_dcache_area(&lpi_data.lpi_property[lpi], LPI_BLOCK);
>> +
>> +    /* Tell the redistributor about the changed enabled bits and map the 
>> LPIs */
>> +    for ( i = 0; i < LPI_BLOCK; i++ )
>> +    {
>> +        its_send_cmd_mapti(its, host_devid, eventid + i, lpi + i + 8192, 0);
>> +        its_send_cmd_inv(its, host_devid, eventid + i);
>> +    }
>> +
>> +    its_send_cmd_sync(its, 0);
>> +
>> +    next_lpi = lpi + LPI_BLOCK;
>> +    return lpi + 8192;
>> +}
>> +
>> +static int gicv3_free_host_lpi_block(struct host_its *its, uint32_t lpi)
>> +{
>> +    union host_lpi *hlpi, empty_lpi = { .dom_id = INVALID_DOMID };
>> +    int i;
>> +
>> +    hlpi = gic_get_host_lpi(lpi);
>> +    if ( !hlpi )
>> +        return -ENOENT;
>> +
>> +    spin_lock(&lpi_data.host_lpis_lock);
>> +
>> +    for ( i = 0; i < LPI_BLOCK; i++ )
>> +        hlpi[i].data = empty_lpi.data;
>> +
>> +    spin_unlock(&lpi_data.host_lpis_lock);
>> +
>> +    return 0;
>> +}
>> +
>>  static void remove_mapped_guest_device(struct its_devices *dev)
>>  {
>> +    int i;
>> +
>>      if ( dev->hw_its )
>>          its_send_cmd_mapd(dev->hw_its, dev->host_devid, 0, 0, false);
>>  
>> +    for ( i = 0; i < dev->eventids / 32; i++ )
>> +        gicv3_free_host_lpi_block(dev->hw_its, dev->host_lpis[i]);
>> +
>>      xfree(dev->itt_addr);
>> +    xfree(dev->host_lpis);
>>      xfree(dev);
>>  }
>>  
>> @@ -390,7 +605,7 @@ int gicv3_its_map_device(struct domain *d, int 
>> host_devid, int guest_devid,
>>      void *itt_addr = NULL;
>>      struct its_devices *dev, *temp;
>>      struct host_its *hw_its;
>> -    int ret;
>> +    int ret, i;
>>  
>>      /* check for already existing mappings */
>>      spin_lock(&d->arch.vgic.its_devices_lock);
>> @@ -435,9 +650,18 @@ int gicv3_its_map_device(struct domain *d, int 
>> host_devid, int guest_devid,
>>          return -ENOMEM;
>>      }
>>  
>> +    dev->host_lpis = xzalloc_array(uint32_t, BIT(bits) / 32);
>> +    if ( !dev->host_lpis )
>> +    {
>> +        xfree(dev);
>> +        xfree(itt_addr);
>> +        return -ENOMEM;
>> +    }
>> +
>>      ret = its_send_cmd_mapd(hw_its, host_devid, bits - 1,
>>                              virt_to_maddr(itt_addr), true);
>>      if (ret) {
>> +        xfree(dev->host_lpis);
>>          xfree(itt_addr);
>>          xfree(dev);
>>          return ret;
>> @@ -453,6 +677,13 @@ int gicv3_its_map_device(struct domain *d, int 
>> host_devid, int guest_devid,
>>      list_add_tail(&dev->entry, &d->arch.vgic.its_devices);
>>      spin_unlock(&d->arch.vgic.its_devices_lock);
>>  
>> +    /* Map all host LPIs within this device already. We can't afford to 
>> queue
>> +     * any host ITS commands later on during the guest's runtime.
>> +     */
>> +    for ( i = 0; i < BIT(bits) / 32; i++ )
>> +        dev->host_lpis[i] = gicv3_allocate_host_lpi_block(hw_its, d, 
>> host_devid,
>> +                                                          i * 32);
> 
> Shouldn't we check for errors returned by gicv3_allocate_host_lpi_block?

I just see that passing less than 5 bits will skip this loop at all :-(
Will fix it (also applies to the allocation a bit earlier).

> If BIT(bits) < 32, it is possible to allocate more plpis than necessary.

Yes, but this is not a problem. They won't have a virtual LPI assigned
(=0), so any incoming LPI will be just ignored.

> It is also possible to end up calling mapti with an inexistent eventid
> for host_devid. Could that be a problem?

Not at all. Actually there is no such thing as a "nonexistent event ID",
because the event ID will be written by the device as the payload to the
MSI doorbell address, probably because it learned about it by the
driver. So if we provision an ITTE with an event ID which the device
will never send, that LPI will just never fire.
Since Xen (in contrast to the driver in the domain) has no idea how many
and which MSIs the device will use, we just allocate a bunch of them.
The upper limit (32 atm) is something we probably need to still think
about, though.
I tried to learn a limit from Linux ("nvecs" in its_create_device()
seems to be the source), but couldn't find anything useful other than 32.
We will learn about exceeding this as soon as a domain tries to map a
virtual LPI with an event ID higher than 31, however it's too late to
fix this then. We can bark when this happens to learn if any device ever
does this during our testing to get some heuristic data.

Eventually all boils down to Xen getting more information from Dom0
about the required number of MSIs. We could then even limit the
allocation to less than 32, if that helps.

Cheers,
Andre.

> 
>>      return 0;
>>  }
>>  
>> diff --git a/xen/include/asm-arm/gic-its.h b/xen/include/asm-arm/gic-its.h
>> index 525a29d..d1ebc19 100644
>> --- a/xen/include/asm-arm/gic-its.h
>> +++ b/xen/include/asm-arm/gic-its.h
>> @@ -104,6 +104,7 @@ struct its_devices {
>>      uint32_t guest_devid;
>>      uint32_t host_devid;
>>      uint32_t eventids;
>> +    uint32_t *host_lpis;
>>  };
>>  
>>  extern struct list_head host_its_list;
>> -- 
>> 2.9.0
>>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.