[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Draft C] Xen on ARM vITS Handling



On Fri, 2015-05-29 at 14:40 +0100, Julien Grall wrote:
> Hi Ian,
> 
> NIT: You used my Linaro email which I think is de-activated now :).

I keep finding new address books with that address  in them!

> > ## ITS Translation Table
> >
> > Message signalled interrupts are translated into an LPI via an ITS
> > translation table which must be configured for each device which can
> > generate an MSI.
> 
> I'm not sure what is the ITS Table Table. Did you mean Interrupt
> Translation Table?

I don't think I wrote Table Table anywhere.

I'm referring to the tables which are established by e.g. the MAPD
command and friends, e.g. the thing shown in "4.9.12 Notional ITS Table
Structure".

> > is _not_ guarenteed that a change to the LPI Configuration Table won't
> 
> s/guarenteed/guaranteed/? Or may the first use of this word was wrong?

guaranteed is correct, I can never remember it though.

> > XXX there are other aspects to virtualising the ITS (LPI collection
> > management, assignment of LPI ranges to guests, device
> > management). However these are not currently considered here. XXX
> > Should they be/do they need to be?
> 
> I think we began to cover these aspect with the section "command emulation".

Some aspects, yes. I went with:

        There are other aspects to virtualising the ITS (LPI collection
        management, assignment of LPI ranges to guests, device
        management). However these are only considered here to the extent
        needed for describing the vITS emulation.

> > XXX In the context of virtualised device ids this may not be the case,
> > e.g. we can arrange for (mostly) contiguous device ids and we know the
> > bound is significantly lower than 2^32
> 
> Well, the deviceID is computed from the BDF and some DMA alias. As the
> algorithm can't be tweaked, it's very likely that we will have
> non-contiguous Device ID. See pci_for_each_dma_alias in Linux
> (drivers/pci/search.c).

The implication here is that deviceID is fixed in hardware and is used
by driver domain software in contexts where we do not get the
opportunity to translate is that right? What contexts are those?

Note that the BDF is also something which we could in principal
virtualise (we already do for domU). Perhaps that is infeasible for dom0
though?

That gives me two thoughts.

The first is that although device identifiers are not necessarily
contiguous, they are generally at least grouped and not allocated at
random through the 2^32 options. For example a PCI Host bridge typically
has a range of device ids associated with it and each device has a
device id derived from that.

I'm not sure if we can leverage that into a more useful data structure
than an R-B tree, or for example to arrange for the R-B to allow for the
translation of a device within a span into the parent span and from
there do the lookup. Specifically when looking up a device ID
corresponding to a PCI device we could arrange to find the PCI host
bridge and find the actual device from there. This would keep the RB
tree much smaller and therefore perhaps quicker? Of course that depends
on what the lookup from PCI host bridge to a device looked like.

The second is that perhaps we can do something simpler for the domU
case, if we were willing to tolerate it being different from dom0.

> > Possible efficient data structures would be:
> >
> > 1. List: The lookup/deletion is in O(n) and the insertion will depend
> >     if the device should be sorted following their identifier. The
> >     memory overhead is 18 bytes per element.
> > 2. Red-black tree: All the operations are O(log(n)). The memory
> >     overhead is 24 bytes per element.
> >
> > A Red-black tree seems the more suitable for having fast deviceID
> > validation even though the memory overhead is a bit higher compare to
> > the list.
> >
> > ### Event ID (`vID`)
> >
> > This is the per-device Interrupt identifier (i.e. the MSI index). It
> > is configured by the device driver software.
> >
> > It is not necessary to translate a `vID`, however they may need to be
> > represented in various data structures given to the pITS.
> >
> > XXX is any of this true?
> 
> 
> Right, the vID will always be equal to the pID. Although you will need
> to associate a physical LPI for every pair (vID, DevID).

I think in the terms defined by this document that is (`ID`, `vID`) =>
an LPI. Right?

Have we considered how this mapping will be tracked?
 
> > ### Interrupt Collection (`vCID`)
> >
> > This parameter is used in commands which manage collections and
> > interrupt in order to move them for one CPU to another. The ITS is
> > only mandated to implement N + 1 collections where N is the number of
> > processor on the platform (i.e max number of VCPUs for a given
> > guest). Furthermore, the identifiers are always contiguous.
> >
> > If we decide to implement the strict minimum (i.e N + 1), an array is
> > enough and will allow operations in O(1).
> >
> > XXX Could forgo array and go straight to vcpu_info/domain_info.
> 
> Not really, the number of collection is always one higher than the
> number of VCPUs. How would you store the last collection?

In domain_info. What I meant was:

    if ( vcid == domain->nr_vcpus )
         return domain->interrupt_collection
    else if ( vcid < domain_nr_vcpus )
         return domain->vcpus[vcid]->interrupt_colleciton
    else
         invalid vcid.

Similar to how SPI vs PPI interrupts are handled.

> > ## Command Translation
> >
> > Of the existing GICv3 ITS commands, `MAPC`, `MAPD`, `MAPVI`/`MAPI` are
> > potentially time consuming commands as these commands creates entry in
> > the Xen ITS structures, which are used to validate other ITS commands.
> >
> > `INVALL` and `SYNC` are global and potentially disruptive to other
> > guests and so need consideration.
> 
> INVALL and SYNC are not global. They both take a parameter: vCID for
> INVALL and vTarget for SYNC.

By global I meant not associated with a specific device. I went with:

        `INVALL` and `SYNC` are not specific to a given device (they are per
        collection per target respectively) and are therefore potentially
        disruptive to other guests and so need consideration.

> INVALL ensures that any interrupts in the specified collection are
> re-load. SYNC ensures that all the previous command, and all outstanding
> physical actions relating to the specified re-distributor are completed.

> 
> > All other ITS command like `MOVI`, `DISCARD`, `INV`, `INT`, `CLEAR`
> > just validate and generate physical command.
> >
> > ### `MAPC` command translation
> >
> > Format: `MAPC vCID, vTA`
> >
> > - `MAPC pCID, pTA` physical ITS command is generated
> 
> We should not send any MAPC command to the physical ITS. The collection
> is already mapped during Xen boot.

What is the plan for this start of day mapping? One collection per pCPU
and ignore the rest?

It seems (section 4.9.2) that there are two potential kinds of
collections, ones internal to the ITS and others where data is held in
external memory. The numbers of both are limited by the hardware.

I suppose the internal ones will be faster.

Supposing that a guest is likely to use collections to map interrupts to
specific vcpus, and that the physical collections will be mapped to
pcpus, I suppose this means we will need to do some relatively expensive
remapping (corresponding to moving the IRQ to another collection) in
arch_move_irqs? Is that the best we can do?

> This command should only assign a pCID to the vCID.

Does it not also need to remap some interrupts to that new pCID?


> >
> > ### `MAPD` Command translation
> >
> > Format: `MAPD device, Valid, ITT IPA, ITT Size`
> >
> > `MAPD` is sent with `Valid` bit set if device needs to be added and reset
> > when device is removed.
> 
> Another case: The ITT is replaced. This use case needs more care because
> we need to ensure that all the interrupt are disabled before switching
> to the new ITT.

I've added a note since I think this is going to be a discussion in the
other sub thread.

> 
> > If `Valid` bit is set:
> >
> > - Allocate memory for `its_device` struct
> > - Validate ITT IPA & ITT size and update its_device struct
> > - Find number of vectors(nrvecs) for this device by querying PCI
> >    helper function
> > - Allocate nrvecs number of LPI XXX nrvecs is a function of `ITT Size`?
> > - Allocate memory for `struct vlpi_map` for this device. This
> >    `vlpi_map` holds mapping of Virtual LPI to Physical LPI and ID.
> > - Find physical ITS node with which this device is associated
> 
> XXX: The MAPD command is using a virtual DevID which is different that
> the pDevID (because the BDF is not the same). How do you find the
> corresponding translation?

Not sure, do we need a per-domain thing mapping vBDF to $something? Do
we alreayd have such a thing, e.g. in the SMMU code?

I've added a note.

> 
> > - Call `p2m_lookup` on ITT IPA addr and get physical ITT address
> > - Validate ITT Size
> > - Generate/format physical ITS command: `MAPD, ITT PA, ITT Size`
> 
> I had some though about the other validation problem with the ITT. The
> region will be used by the ITS to store the mapping between the ID and
> the LPI as long as some others information.

ITYM "as well as some other information"?

> I guess that if the OS is playing with the ITT (such as writing in it)
> the ITS will behave badly. We have to ensure to the guest will never
> write in it and by the same occasion that the same region is not passed
> to 2 devices.

I don't think we will be exposing the physical ITT to the guest, will
we? That will live in memory which Xen owns and controls and doesn't
share with any guest.

In fact, I don't know that a vITS will need an ITT memory at all, i.e.
most of our GITS_BASERn will be unimplemented.

In theory we could use these registers to offload some of the data
structure storage requirements to the guest, but that would require
great care to validate any time we touched it (or perhaps just
p2m==r/o), I think it is probably not worth the stress if we can just
use regular hypervisor side data structures instead? (This stuff is just
there for the h/w ITS which doesn't have the luxury of xmalloc).

> 
> > Here the overhead is with memory allocation for `its_device` and `vlpi_map`
> >
> > XXX Suggestion was to preallocate some of those at device passthrough
> > setup time?
> 
> Some of the informations can even be setup when the PCI device is added
> to Xen (such as the number of MSI supported and physical LPIs chunk).

Yes, assuming there are sufficient LPIs to allocate in this way. That's
not clear though, is it?

> > If Validation bit is not set:
> >
> > - Validate if the device exits by checking vITS device list
> > - Clear all `vlpis` assigned for this device
> > - Remove this device from vITS list
> > - Free memory
> >
> > XXX If preallocation presumably shouldn't free here either.
> 
> Right. We could use a field to say if the device is activated or not.
> 
> >
> > ### `MAPVI`/`MAPI` Command translation
> >
> > Format: `MAPVI device, ID, vID, vCID`
> 
> Actually the 2 commands are completely different:
>       - MAPI maps a (DevID, ID) to a collection
>       - MAVI maps a (DevID, ID) to a collection and an LPI.

MAPVI for the second one I think?

The difference is that MAPI lacks the vID argument?

> The process described below is only about MAPVI.

OK. I've left a placeholder for `MAPI`.

> Also what about interrupt re-mapping?

I don't know, what about it?

> > - Validate vCID and get pCID by searching cid_map
> >
> > - if vID does not have entry in `vlpi_entries` of this device allocate
> >    a new pID from `vlpi_map` of this device and update `vlpi_entries`
> >    with new pID
> 
> What if the vID is already used by another

I think Vijay's updates already addressed this.

> > - Allocate irq descriptor and add to RB tree
> > - call `route_irq_to_guest()` for this pID
> > - Generate/format physical ITS command: `MAPVI device ID, pID, pCID`
> 
> 
> > Here the overhead is allocating physical ID, allocate memory for irq
> > descriptor and routing interrupt.
> >
> > XXX Suggested to preallocate?
> 
> Right. We may also need to have a separate routing for LPIs as the
> current function is quite long to execute.
> 
> I was thinking into routing the interrupt at device assignation
> (assuming we allocate the pLPIs at that time). And only set the mapping
> to vLPIs when the MAPI is called.

Please propose concrete modifications to the text, since I can't figure
out what you mean to change here.

> 
> >
> > ### `INVALL` Command translation
> 
> The format of INVALL is INVALL collection
> 
> > A physical `INVALL` is only generated if the LPI dirty bitmap has any
> > bits set. Otherwise it is skipped.
> >
> > XXX Perhaps bitmap should just be a simple counter?
> 
> We would need to handle it per collection.

Hrm, this complicates things a bit. Don't we need to invalidate any pCID
which has a routing of an interrupt to vCID? i.e. potentially multiple
INVALL?

> > XXX bitmap is host global, a per-domain bitmap would allow us to elide
> > `INVALL` unless an LPI associated with the guest making the request
> > was dirty. Would also need some sort of "ITS INVALL" clock in order
> > that other guests can elide their own `INVALL` if one has already
> > happened. Complexity not worth it at this stage?
> 
> Given that I just discovered that INVALL is also taking a collection in
> parameter, it will likely be more complex.

Yes.

> 
> >
> > ### `SYNC` Command translation
> 
> The format of SYNC is SYNC target. It's only ensure the completion for a
> re-distributor.
> Although, the pseudo-code (see perform_sync in 5.13.22 in
> PRD03-GENC-010745 24.0) seems to say it waits for all re-distributor...
> I'm not sure what to trust.

Yes, it's confusing but the first sentence of 5.13.22 says:
        This command specifies that the ITS must wait for completion of
        internal effects of all previous commands, and all
        outstanding physical actions relating to the specified
        re-distributor.
        
So, by my reading, all redistributors need to have seen the effect of
any command issued to the given redistributor (not all commands given to
any redistributor).

Example: given command cA issued to redistributor rA and command cB
issued to redistrubutor rB and then issuing SYNC(rA) must ensure that cA
is visible to _both_ rA and rB, but doesn't say anything regarding cB at
all.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.