[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Draft D] Xen on ARM vITS Handling



Hi Ian,

On 04/06/15 14:54, Ian Campbell wrote:
> ### Device Identifiers
> 
> Each device using the ITS is associated with a unique "Device
> Identifier".
> 
> The device IDs are properties of the implementaiton and are typically

implementation

> described via system firmware, e.g. the ACPI IORT table or via device
> tree.
> 
> The number of device ids in a system depends on the implementation and
> can be discovered via `GITS_TYPER.Devbits`. This field allows an ITS
> to have up to 2^32 devices.

[..]

> # Scope
> 
> The ITS is rather complicated, especially when combined with
> virtualisation. To simplify things we initially omit the following
> functionality:
> 
> - Interrupt -> vCPU -> pCPU affinity. The management of physical vs
>   virtual Collections is a feature of GICv4, thus is omitted in this
>   design for GICv3. Physical interrupts which occur on a pCPU where
>   the target vCPU is not already resident will be forwarded (via IPI)
>   to the correct pCPU for injection via the existing
>   `vgic_vcpu_inject_irq` mechanism (extended to handle LPI injection
>   correctly).
> - Clearing of the pending state of an LPI under various circumstances
>   (`MOVI`, `DISCARD`, `CLEAR` commands) is not done. This will result
>   in guests seeing some perhaps spurious interrupts.
> - vITS functionality will only be available on 64-bit ARM hosts,
>   avoiding the need to worry about fast access to guest owned data
>   structures (64-bit uses a direct map). (NB: 32-bit guests on 64-bit
>   hosts can be considered to have access)
> 
> XXX Can we assume that `GITS_TYPER.Devbits` will be sane,
> i.e. requiring support for the full 2^32 device ids would require a
> 32GB device table even for native, which is improbable except on
> systems with RAM measured in TB. So we can probably assume that
> Devbits will be appropriate to the size of the system. _Note_: We
> require per guest device tables, so size of the native Device Table is
> not the only factor here.

As we control the vBDF we can control the vDevID. If we have a single
PCI bus, the number won't be too high.

> XXX Likewise can we assume that `GITS_TYPER.IDbits` will be sane?

The GITS_TYPER.IDbits of who? The physical ITS?

> i.e. that the required ITT table size will be reasonable?

> # Unresolved Issues
> 
> Various parts are marked with XXX. Most are minor, but there s one
> more or less major one, which we may or may not be able to live with
> for a first implementation:
> 
> 1. When handling Virtual LPI Configuration Table writes we do not have
>    a Device ID, so we cannot consult the virtual Device Table, ITT etc
>    to determine if the LPI is actually mapped. This means that the
>    physical LPI enable/disable is decoupled from the validity of the
>    virtual ITT. Possibly resulting in spurious LPIs which must be
>    ignored.

> This issue is discussed further in the relevant places in the text,
> marked with `XXX UI1`.
> 
> # pITS
> 
> ## Assumptions
> 
> It is assumed that `GITS_TYPER.IDbits` is large enough that there are
> sufficient LPIs available to cover the sum of the number of possible
> events generated by each device in the system (that is the sum of the
> actual events for each bit of hardware, rather than the notional
> per-device maximum from `GITS_TYPER.Idbits`).
> 
> This assumption avoids the need to do memory allocations and interrupt
> routing at run time, e.g. during command processing by allowing us to
> setup everything up front.
> 
> ## Driver
> 
> The physical driver will provide functions for enabling, disabling
> routing etc a specified interrupt, via the usual Xen APIs for doing
> such things.
> 
> This will likely involve interacting with the physical ITS command
> queue etc. In this document such interactions are considered internal
> to the driver (i.e. we care that the API to enable an interrupt
> exists, not how it is implemented).
> 
> ## Device Table
> 
> The `pITS` device table will be allocated and given to the pITS at
> start of day.

We don't really care about this. This is part of the memory provision at
initialization based on GITS_BASER*

Furthermore, the ITS may already have in the table in-memory and
therefore this allocation is not neccesary.

> 
> ## Collections
> 
> The `pITS` will be configured at start of day with 1 Collection mapped
> to each physical processor, using the `MAPC` command on the physical
> ITS.
> 
> ## Per Device Information
> 
> Each physical device in the system which can be used together with an
> ITS (whether using passthrough or not) will have associated with it a
> data structure:
> 
>     struct its_device {
>         uintNN_t phys_device_id;
>         uintNN_t virt_device_id;
>         unsigned int *events;
>         unsigned int nr_events;
>         struct page_info *pitt;
>         unsigned int nr_pitt_pages

You need to have a pointer to the pITS associated.

>     };
> 
> Where:
> 
> - `phys_device_id`: The physical device ID of the physical device
> - `virt_device_id`: The virtual device ID if the device is accessible
>   to a domain
> - `events`: An array mapping a per-device event number into a physical
>   LPI.
> - `nr_events`: The number of events which this device is able to
>   generate.
> - `pitt`, `nr_pitt_pages`: Records allocation of pages for physical
>   ITT (not directly accessible).
> 
> During its lifetime this structure may be referenced by several
> different mappings (e.g. physical and virtual device id maps, virtual
> collection device id).
> 
> ## Device Discovery/Registration and Configuration
> 
> Per device information will be discovered based on firmware tables (DT
> or ACPI) and information provided by dom0 (e.g. registration via
> PHYSDEVOP_pci_device_add or new custom hypercalls).
> 
> This information shall include at least:
> 
> - The Device ID of the device.
> - The maximum number of Events which the device is capable of
>   generating.

Well, the maximum number of Events doesn't need to come through a new
hypercall. We can directly retrieve it via the PCI CFG space.

> 
> When a device is discovered/registered (i.e. when all necessary
> information is available) then:
> 
> - `struct its_device` and the embedded `events` array will be
>   allocated (the latter with `nr_events` elements).
> - The `struct its_device` will be inserted into a mapping (possibly an
>   R-B tree) from its physical Device ID to the `struct its`.
> - `nr_events` physical LPIs will be allocated and recorded in the
>   `events` array.
> - An ITT table will be allocated for the device and the appropriate
>   `MAPD` command will be issued to the physical ITS. The location will
>   be recorded in `struct its_device.pitt`.
> - Each Event which the device may generate will be mapped to the
>   corresponding LPI in the `events` array and a collection, by issuing
>   a series of `MAPVI` commands. Events will be assigned to physical
>   collections in a round-robin fashion.
> 
> This setup must occur for a given device before any ITS interrupts may
> be configured for the device and certainly before a device is passed
> through to a guest. This implies that dom0 cannot use MSIs on a PCI
> device before having called `PHYSDEVOP_pci_device_add`.

Sounds sensible.

> 
> # Device Assignment
> 
> Each domain will have an associated mapping from virtual device ids
> into a data structure describing the physical device, including a
> reference to the relevant `struct its_device`.
> 
> The number of possible device IDs may be large so a simple array or
> list is likely unsuitable. A tree (e.g. Red-Black may be a suitable
> data structure. Currently we do not need to perform lookups in this
> tree on any hot paths.

Even though, the lookup would be quick.

> 
> _Note_: In the context of virtualised device ids (especially for domU)
> it may be possible to arrange for the upper bound on the number of
> device IDs to be lower allowing a more efficient data structure to be
> used. This is left for a future improvement.
> 
> When a device is assigned to a domain (including to domain 0) the
> mapping for the new virtual device ID will be entered into the tree.
> 
> During assignment all LPIs associated with the device will be routed
> to the guest (i.e. `route_irq_to_guest` will be called for each LPI in
> the `struct its_device.events` array).
> 
> # vITS
> 
> A guest domain which is allowed to use ITS functionality (i.e. has
> been assigned pass-through devices which can generate MSIs) will be
> presented with a virtualised ITS.
> 
> Accesses to the vITS registers will trap to Xen and be emulated and a
> virtualised Command Queue will be provided.
> 
> Commands entered onto the virtual Command Queue will be translated
> into physical commands, as described later in this document.
> 
> There are other aspects to virtualising the ITS (LPI collection
> management, assignment of LPI ranges to guests, device
> management). However these are only considered here to the extent
> needed for describing the vITS emulation.
> 
> ## Xen interaction with guest OS provisioned vITS memory
> 
> Memory which the guest provisions to the vITS (ITT via `MAPD` or other
> tables via `GITS_BASERn`) needs careful handling in Xen.
> 
> Since Xen cannot trust data in data structures contained in such
> memory if a guest can trample over it at will. Therefore Xen either
> must take great care when accessing data structures stored in such
> memory to validate the contents e.g. not trust that values are within
> the required limits or it must take steps to restrict guest access to
> the memory when it is provisioned. Since the data structures are
> simple and most accessors need to do bounds check anyway it is
> considered sufficient to simply do the necessary checks on access.
> 
> Most data structures stored in this shared memory are accessed on the
> hot interrupt injection path and must therefore be quickly accessbile

accessible

> from within Xen. Since we have restricted vits support to 64-bit hosts
> only `map_domain_page` is fast enough to be used on the fly and
> therefore we do not need to be concerned about unbounded amounts of
> permanently mapped memory consumed by each `MAPD` command.
> 
> Although `map_domain_page` is fast, `p2m_lookup` (translation from IPA
> to PA) is not necessarily so. For now we accept this, as a future
> extension a sparse mapping of the guest device table in vmap space
> could be considered, with limits on the total amount of vmap space which
> we allow each domain to consume.
> 
> ## vITS properties
> 
> The vITS implementation shall have:
> 
> - `GITS_TYPER.HCC == nr_vcpus + 1`.
> - `GITS_TYPER.PTA == 0`. Target addresses are linear processor numbers.
> - `GITS_TYPER.Devbits == See below`.
> - `GITS_TYPER.IDbits == See below`.
> - `GITS_TYPER.ITT Entry Size == 7`, meaning 8 bytes, which is the size
>   of `struct vitt` (defined below).
> 
> `GITS_TYPER.Devbits` and `GITS_TYPER.Idbits` will need to be chosen to
> reflect the host and guest configurations (number of LPIs, maximum
> device ID etc).
> 
> Other fields (not mentioned here) will be set to some sensible (or
> mandated) value.
> 
> The `GITS_BASER0` will be setup to request sufficient memory for a
> device table consisting of entries of:
> 
>     struct vdevice_table {
>         uint64_t vitt_ipa;
>         uint32_t vitt_size;
>         uint32_t padding;
>     };
>     BUILD_BUG_ON(sizeof(struct vdevice_table) != 16);
> 
> On write to `GITS_BASE0` the relevant details of the Device Table

GITS_BASER0

> (IPA, size, cache attributes to use when mapping) will be recorded in
> `struct domain`.

map_domain_page assume that the memory is WB/WT. What happen if the
guest decides to choose a different attribute?

> 
> All other `GITS_BASERn.Valid == 0`.
> 
> ## vITS to pITS mapping
> 
> A physical system may have multiple physical ITSs.
> 
> With the simplified vits command model presented here only a single
> `vits` is required.

What about DOM0? We would have to browse the firmware table (ACPI or DT)
to replace the ITS parent.

> 
> In the future a more complex arrangement may be desired. Since the
> choice of model is internal to the hypervisor/tools and is
> communicated to the guest via firmware tables we are not tied to this
> model as an ABI if we decide to change.
> 
> ## LPI Configuration Table Virtualisation
> 
> A guest's write accesses to its LPI Configuration Table (which is just
> an area of guest RAM which the guest has nominated) will be trapped to
> the hypervisor, using stage 2 MMU permissions, in order for changes to
> be propagated into the host interrupt configuration.
> 
> On write `bit[0]` of the written byte is the enable/disable state for
> the irq and is handled thus:
> 
>     lpi = (addr - table_base);
>     if ( byte & 1 )
>         enable_irq(lpi);
>     else
>         disable_irq(lpi);
> 
> Note that in the context of this emulation we do not have access to a
> Device ID, and therefore cannot make decisions based on whether the
> LPI/event has been `MAPD`d etc. In any case we have an `lpi` in our
> hand and not an `event`, IOW we would need to do a _reverse_ lookup in
> the ITT.

I'm struggling to see how this would work. After enabling/disabling an
IRQ you need to send an INV which require the devID and the eventID.

Also, until now you haven't explained how the vLPI will be mapped to the
pLPI. If you have this mapping, you are able to get the device, ID.

> 
> LPI priority (the remaining bits in the written byte) is currently
> ignored.
> 
> ## LPI Pending Table Virtualisation
> 
> XXX Can we simply ignore this? 4.8.5 suggests it is not necessarily in
> sync and the mechanism to force a sync is `IMPLEMENTATION DEFINED`.
> 
> ## Device Table Virtualisation
> 
> The IPA, size and cacheability attributes of the guest device table
> will be recorded in `struct domain` upon write to `GITS_BASER0`.
>
> In order to lookup an entry for `device`:
> 
>     define {get,set}_vdevice_entry(domain, device, struct device_table 
> *entry):
>         offset = device*sizeof(struct vdevice_table)
>         if offset > <DT size>: error
> 
>         dt_entry = <DT base IPA> + device*sizeof(struct vdevice_table)
>         page = p2m_lookup(domain, dt_entry, p2m_ram)
>         if !page: error
>         /* nb: non-RAM pages, e.g. grant mappings,
>          * are rejected by this lookup */

It is allowed to pass device memory here. See Device-nGnRnE.

> 
>         dt_mapping = map_domain_page(page)
> 
>         if (set)
>              dt_mapping[<appropriate page offset from device>] = *entry;
>         else
>              *entry = dt_mapping[<appropriate page offset>];
> 
>         unmap_domain_page(dt_mapping)
> 
> Since everything is based upon IPA (guest addresses) a malicious guest
> can only reference its own RAM here.
> 
> ## ITT Virtualisation
> 
> The location of a VITS will have been recorded in the domain Device
> Table by a `MAPI` or `MAPVI` command and is looked up as above.
> 
> The `vitt` is a `struct vitt`:
> 
>     struct vitt {
>         uint16_t valid:1;
>         uint16_t pad:15;
>         uint16_t collection;
>         uint32_t vpli;
>     };
>     BUILD_BUG_ON(sizeof(struct vitt) != 8);
> 
> A lookup occurs similar to for a device table, the offset is range
> checked against the `vitt_size` from the device table. To lookup
> `event` on `device`:
> 
>     define {get,set}_vitt_entry(domain, device, event, struct vitt *entry):
>         get_vdevice_entry(domain, device, &dt)
> 
>         offset = device*sizeof(struct vitt);

s/device/event/ ?

>         if offset > dt->vitt_size: error
> 
>         vitt_entry = dt->vita_ipa + event*sizeof(struct vitt)
>         page = p2m_lookup(domain, vitt_entry, p2m_ram)
>         if !page: error
>         /* nb: non-RAM pages, e.g. grant mappings,
>          * are rejected by this lookup */
> 
>         vitt_mapping = map_domain_page(page)
> 
>         if (set)
>              vitt_mapping[<appropriate page offset from event>] = *entry;
>         else
>              *entry = vitt_mapping[<appropriate page offset>];
> 
>         unmap_domain_page(entry)
> 
> Again since this is IPA based a malicious guest can only point things
> to its own ram.
> 
> ## Collection Table Virtualisation
> 
> A pointer to a dynamically allocated array `its_collections` mapping
> collection ID to vcpu ID will be added to `struct domain`. The array
> shall have `nr_vcpus + 1` entries and resets to ~0 (or another
> explicitly invalid vpcu nr).
> 
> ## Virtual LPI injection
> 
> As discussed above the `vgic_vcpu_inject_irq` functionality will need
> to be extended to cover this new case, most likely via a new
> `vgic_vcpu_inject_lpi` frontend function.
> 
> `vgic_vcpu_inject_lpi` receives a `struct domain *` and a virtual
> interrupt number (corresponding to a vLPI) and needs to figure out
> which vcpu this should map to.
> 
> To do this it must look up the Collection ID associated (via the vITS)
> with that LPI.
> 
> Proposal: Add a new `its_device` field to `struct irq_guest`, a
> pointer to the associated `struct its_device`. The existing `struct
> irq_guest.virq` field contains the event ID (perhaps use a `union`
> to give a more appropriate name) and _not_ the virtual LPI. Injection
> then consists of:
> 
>         d = irq_guest->domain
>         virq = irq_guest->virq
>         its_device = irq_guest->its_device
> 
>         get_vitt_entry(d, its_device->virt_device_id, virq, &vitt)
>         vcpu = d->its_collections[vitt.collection]
>         vgic_vcpu_inject_irq(d, &d->vcpus[vcpu])

Shouldn't you pass at least the vLPIs?

We would also need to ensure that the vitt.vpli is effectively an LPI.
Otherwise the guest may send an invalid value (such as 1023) which will
potentially crash the platform.

> In the event that the IIT is not `MAPD`d, or the Event has not been
> `MAPI`/`MAPVI`d or the collection is not `MAPC`d here the interrupt is
> simply ignored. Note that this can happen because LPI mapping is
> decoupled from LPI enablement. In particular writes to the LPI
> Configuration Table do not include a Device ID and therefore cannot
> make decisions based on the ITT.
> 
> XXX UI1 if we could find a reliable way to reenable then could
> potentially disable LPI on error and reenable later (taking a spurious
> Xen interrupt for each possible vits misconfiguration). IOW if the
> interrupt is invalid for each of these reasons we can disable and
> reenable as described:
> 
> - Not `MAPD`d -- on `MAPD` enable all associate LPIs which are enabled
>   in LPI CFG Table.
> - Not `MAPI`/`MAPVI`d -- on `MAPI`/`MAPVI` enable LPI if enabled in
>   CFG Table.
> - Not `MAPC`d -- tricky. Need to know lists of LPIs associated with a
>   virtual collection. A `list_head` in `struct irq_guest` implies a
>   fair bit of overhead, number of LPIs per collection is the total
>   number of LPIs (guest could assign all to the same
>   collection). Could walk entire LPI CFG and reenable all set LPIs,
>   might involve walking over several KB of memory though. Could inject
>   unmapped collections to vcpu0, forcing the guest to deal with the
>   spurious interrupts?
>
> XXX Only the `MAPC` issue seems problematic. Is this a critical issue or
> can we get away with it?

This is not the only issue. With your solution, you let the possibility
to the guest having one vLPI for multiple (devID, ID)/LPI. Even if we
validate it when the command MAPI/MAPVI is sent, the guest could modify
himself the ITT.

Furthermore, each virtual IRQ of a domain is associated to a struct
pending_irq. This structure contains an irq_desc which is a pointer to
the physical IRQ. With a craft ITT the guest could mess up those
datastructure. Although, I think we can get back on our feet after the
domain is destroyed.

After reading this draft, I think we can avoid to browse the Device
Table and the ITT. As said on the previous paragraph, the pending_irq
structure is linked to an irq_desc. In your proposal, you suggested to
store the its_device in the irq_guest (part of irq_desc). If we make
usage of pending_irq->desc to store the physical descriptor, we can have
a mapping vLPI <=> pLPI. Therefore, this would resolve UI1 and AFAICT,
the memory usage would be the same in Xen of the ITT/Device base solution.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.