[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Draft C] Xen on ARM vITS Handling



On Mon, 2015-06-01 at 16:29 +0100, Julien Grall wrote:
> On 01/06/15 14:12, Ian Campbell wrote:
> > On Fri, 2015-05-29 at 14:40 +0100, Julien Grall wrote:
> >> Hi Ian,
> 
> Hi Ian,
> 
> >> NIT: You used my Linaro email which I think is de-activated now :).
> > 
> > I keep finding new address books with that address  in them!
> > 
> >>> ## ITS Translation Table
> >>>
> >>> Message signalled interrupts are translated into an LPI via an ITS
> >>> translation table which must be configured for each device which can
> >>> generate an MSI.
> >>
> >> I'm not sure what is the ITS Table Table. Did you mean Interrupt
> >> Translation Table?
> > 
> > I don't think I wrote Table Table anywhere.
> 
> Sorry I meant "ITS translation table"
> 
> > I'm referring to the tables which are established by e.g. the MAPD
> > command and friends, e.g. the thing shown in "4.9.12 Notional ITS Table
> > Structure".
> 
> On previous paragraph you are referring particularly to "Interrupt
> Translation Table". This is the only table that is configured per device.

I'm afraid I'm still not getting your point. Please quote the exact text
which you think is wrong and if possible suggest an alternative.

> [..]
> 
> >>> XXX there are other aspects to virtualising the ITS (LPI collection
> >>> management, assignment of LPI ranges to guests, device
> >>> management). However these are not currently considered here. XXX
> >>> Should they be/do they need to be?
> >>
> >> I think we began to cover these aspect with the section "command 
> >> emulation".
> > 
> > Some aspects, yes. I went with:
> > 
> >         There are other aspects to virtualising the ITS (LPI collection
> >         management, assignment of LPI ranges to guests, device
> >         management). However these are only considered here to the extent
> >         needed for describing the vITS emulation.
> > 
> >>> XXX In the context of virtualised device ids this may not be the case,
> >>> e.g. we can arrange for (mostly) contiguous device ids and we know the
> >>> bound is significantly lower than 2^32
> >>
> >> Well, the deviceID is computed from the BDF and some DMA alias. As the
> >> algorithm can't be tweaked, it's very likely that we will have
> >> non-contiguous Device ID. See pci_for_each_dma_alias in Linux
> >> (drivers/pci/search.c).
> > 
> > The implication here is that deviceID is fixed in hardware and is used
> > by driver domain software in contexts where we do not get the
> > opportunity to translate is that right? What contexts are those?
> 
> No, the driver domain software will always use a virtual DeviceID (based
> on the vBDF and other things). The problem I wanted to raise is how to
> translate back the vDeviceID to a physical deviceID/BDF.

Right, so this goes back to my original point, which is that if we
completely control the translation from vDeviceID to pDeviceID/BDF then
the vDeviceId space need not be sparse and need not utilise the entire
2^32 space, at least for domU uses.

> > Note that the BDF is also something which we could in principal
> > virtualise (we already do for domU). Perhaps that is infeasible for dom0
> > though?
> 
> For DOM0 the virtual BDF is equal to the physical BDF. So the both
> deviceID (physical and virtual) will be the same.
> 
> We may decide to do vBDF == pBDF for guest too in order to simplify the
> code.

It seems to me that choosing vBDF such that the vDeviceId space is to
our liking would be a good idea.

> > That gives me two thoughts.
> > 
> > The first is that although device identifiers are not necessarily
> > contiguous, they are generally at least grouped and not allocated at
> > random through the 2^32 options. For example a PCI Host bridge typically
> > has a range of device ids associated with it and each device has a
> > device id derived from that.
> 
> Usually it's one per (device, function).

Yes, but my point is that they are generally grouped by bus. The bus is
assigned a (contiguous) range and individual (device,function)=> device
id mappings are based on a formula applied to the base address.

i.e. for a given PCI bus the device ids are in the range 1000..1000+N,
not N random number selected from the 2^32 space.

> 
> > 
> > I'm not sure if we can leverage that into a more useful data structure
> > than an R-B tree, or for example to arrange for the R-B to allow for the
> > translation of a device within a span into the parent span and from
> > there do the lookup. Specifically when looking up a device ID
> > corresponding to a PCI device we could arrange to find the PCI host
> > bridge and find the actual device from there. This would keep the RB
> > tree much smaller and therefore perhaps quicker? Of course that depends
> > on what the lookup from PCI host bridge to a device looked like.
> 
> I'm not sure why you are speaking about PCI host bridge. AFAIK, the
> guest doesn't have a physical host bridge.

It has a virtual one provided by the pciif/pcifront+back thing. Any PCI
bus is behind some sort of host bridge, whether physical, virtual or
"notional".

> Although, this is an optimization that we can think about it later. The
> R-B will already be fast enough for a first implementation. My main
> point was about the translation vDeviceID => pDeviceID.
> 
> > The second is that perhaps we can do something simpler for the domU
> > case, if we were willing to tolerate it being different from dom0.
> > 
> >>> Possible efficient data structures would be:
> >>>
> >>> 1. List: The lookup/deletion is in O(n) and the insertion will depend
> >>>     if the device should be sorted following their identifier. The
> >>>     memory overhead is 18 bytes per element.
> >>> 2. Red-black tree: All the operations are O(log(n)). The memory
> >>>     overhead is 24 bytes per element.
> >>>
> >>> A Red-black tree seems the more suitable for having fast deviceID
> >>> validation even though the memory overhead is a bit higher compare to
> >>> the list.
> >>>
> >>> ### Event ID (`vID`)
> >>>
> >>> This is the per-device Interrupt identifier (i.e. the MSI index). It
> >>> is configured by the device driver software.
> >>>
> >>> It is not necessary to translate a `vID`, however they may need to be
> >>> represented in various data structures given to the pITS.
> >>>
> >>> XXX is any of this true?
> >>
> >>
> >> Right, the vID will always be equal to the pID. Although you will need
> >> to associate a physical LPI for every pair (vID, DevID).
> > 
> > I think in the terms defined by this document that is (`ID`, `vID`) =>
> > an LPI. Right?
> 
> Well, by `ID` you refer about deviceID?

I think this section has suffered from multiple authors using
inconsistent terminology and from some of the naming being confusingly
similar. I'm going to do a pass and switch everything to consistently
use the names used in the spec, with a p- or v- prefix as necessary.

> > Have we considered how this mapping will be tracked?
> 
> I don't think so. We need to track the list of `event ID` enabled for
> this deviceID and the associated `vLPI` and `LPI`.
> 
> Although, if allocate a contiguous chunck of `LPI` it won't be necessary
> to track the later.

Is the deviceID space for a given device linear?

> 
> >>> ### Interrupt Collection (`vCID`)
> >>>
> >>> This parameter is used in commands which manage collections and
> >>> interrupt in order to move them for one CPU to another. The ITS is
> >>> only mandated to implement N + 1 collections where N is the number of
> >>> processor on the platform (i.e max number of VCPUs for a given
> >>> guest). Furthermore, the identifiers are always contiguous.
> >>>
> >>> If we decide to implement the strict minimum (i.e N + 1), an array is
> >>> enough and will allow operations in O(1).
> >>>
> >>> XXX Could forgo array and go straight to vcpu_info/domain_info.
> >>
> >> Not really, the number of collection is always one higher than the
> >> number of VCPUs. How would you store the last collection?
> > 
> > In domain_info. What I meant was:
> > 
> >     if ( vcid == domain->nr_vcpus )
> >          return domain->interrupt_collection
> >     else if ( vcid < domain_nr_vcpus )
> >          return domain->vcpus[vcid]->interrupt_colleciton
> >     else
> >          invalid vcid.
> > 
> > Similar to how SPI vs PPI interrupts are handled.
> 
> Sorry, I didn't understand your suggestion like that.
> 
> I think that can work, although the resulting code may be difficult to
> read/understand because a collection can be moved from one vCPU to another.

Actually, I think the guest is allowed to point multiple collections at
the same vcpu (it's dumb, but allowed), so this scheme probably doesn't
work.

> > 
> >> INVALL ensures that any interrupts in the specified collection are
> >> re-load. SYNC ensures that all the previous command, and all outstanding
> >> physical actions relating to the specified re-distributor are completed.
> > 
> >>
> >>> All other ITS command like `MOVI`, `DISCARD`, `INV`, `INT`, `CLEAR`
> >>> just validate and generate physical command.
> >>>
> >>> ### `MAPC` command translation
> >>>
> >>> Format: `MAPC vCID, vTA`
> >>>
> >>> - `MAPC pCID, pTA` physical ITS command is generated
> >>
> >> We should not send any MAPC command to the physical ITS. The collection
> >> is already mapped during Xen boot.
> > 
> > What is the plan for this start of day mapping? One collection per pCPU
> > and ignore the rest?
> 
> To map the only collection per pCPU. And we can see for improvement later.

If you are making assumptions like this which are not written down then
please point out explicitly that an assumption needs to be documented.

> > Supposing that a guest is likely to use collections to map interrupts to
> > specific vcpus, and that the physical collections will be mapped to
> > pcpus, I suppose this means we will need to do some relatively expensive
> > remapping (corresponding to moving the IRQ to another collection) in
> > arch_move_irqs? Is that the best we can do?
> 
> If we use the same solution as SPI/PPIs (i.e arch_move_irqs), yes. But
> I'm not in favor of this solution which is already expensive for
> SPI/PPIs (after talking with Dario, vCPU may move often between pCPU).
> 
> We could decide to implement a lazy solution (i.e moving the interrupt
> only when it's firing) but that would require to send ITS command in Xen.
> 
> Another idea is to never move the interrupt collection from one pCPU to
> another. The vCID would always be equal to the same pCID, but the
> interrupt would be injected to a different processor depending on the
> associated vTA.

I have an idea here, but I'm going to try and update the document for
the current feedback before proposing a radical change.


> > 
> >> I guess that if the OS is playing with the ITT (such as writing in it)
> >> the ITS will behave badly. We have to ensure to the guest will never
> >> write in it and by the same occasion that the same region is not passed
> >> to 2 devices.
> > 
> > I don't think we will be exposing the physical ITT to the guest, will
> > we? That will live in memory which Xen owns and controls and doesn't
> > share with any guest.
> 
> That's need to be clarify. The RFC sent by Vijay is letting the guest
> feeding the physical ITT.
>
> But I just though that it won't work well, the ITT has to be contiguous
> in the physical memory.
> 
> > In fact, I don't know that a vITS will need an ITT memory at all, i.e.
> > most of our GITS_BASERn will be unimplemented.
> > In theory we could use these registers to offload some of the data
> > structure storage requirements to the guest, but that would require
> > great care to validate any time we touched it (or perhaps just
> > p2m==r/o), I think it is probably not worth the stress if we can just
> > use regular hypervisor side data structures instead? (This stuff is just
> > there for the h/w ITS which doesn't have the luxury of xmalloc).
> 
> GITS_BASERn is only used to feed internal ITS table such as Collections,
> Devices, Virtual Processors.
> 
> For the ITT, the address is passed via MAPD and as to be allocated by
> the guest.
> 
> We could ignore the address provided by the guest but that would mean
> the ITT is allocated twice (one by the guest unused, and the Xen one
> used to feed the pITS). Although, as pointed above the ITT memory region
> allocated by the guest may not be contiguous in physical memory

There is absolutely no way that we are going to be giving guest
controlled memory to the pITS. We absolutely have to allocate and use
Xen memory for the pITT.

The vITT needs to be "shadowed" into the pITT after translation and
verification.

> Furthermore, if we allocate the ITT in Xen, this command will be quick
> to emulate (we could send the MAPD command when the device is attached).
> 
> So less possible security issue.

Far less, yes.

> >>> Here the overhead is with memory allocation for `its_device` and 
> >>> `vlpi_map`
> >>>
> >>> XXX Suggestion was to preallocate some of those at device passthrough
> >>> setup time?
> >>
> >> Some of the informations can even be setup when the PCI device is added
> >> to Xen (such as the number of MSI supported and physical LPIs chunk).
> > 
> > Yes, assuming there are sufficient LPIs to allocate in this way. That's
> > not clear though, is it?
> 
> IHMO it's not an issue. If the ITS doesn't provide enough LPIs a
> baremetal OS will likely not being able to use a device.
> 
> FWIW, Linux is always allocating the LPIs when the device is added based
> on the PCI cfg.

The sort of thing I'm worried about is a device (i.e. some off the shelf
IP) with some number of MSIs supported integrated in a device where, for
whatever reason, the designer has decided that only a smaller set of
MSIs are worth supporting and sizing the ITS accordingly (i.e. deciding
to only support 8 queues in a NIC instead of the full 16).

In such a decide the sum(all devices MSIs) > num LPIs.

Maybe we can defer worrying about this till later.
> > 
> >>> - Allocate irq descriptor and add to RB tree
> >>> - call `route_irq_to_guest()` for this pID
> >>> - Generate/format physical ITS command: `MAPVI device ID, pID, pCID`
> >>
> >>
> >>> Here the overhead is allocating physical ID, allocate memory for irq
> >>> descriptor and routing interrupt.
> >>>
> >>> XXX Suggested to preallocate?
> >>
> >> Right. We may also need to have a separate routing for LPIs as the
> >> current function is quite long to execute.
> >>
> >> I was thinking into routing the interrupt at device assignation
> >> (assuming we allocate the pLPIs at that time). And only set the mapping
> >> to vLPIs when the MAPI is called.
> > 
> > Please propose concrete modifications to the text, since I can't figure
> > out what you mean to change here.
> 
> It's not yet fully clear in my mind.
> 
> If we assuming that we have a chunk of LPIs allocate for a the device
> when it's assigned to the guest, the only missing information is the
> corresponding vLPIs.
> 
> We could decide to setup the internal routing structure (allocation of
> the irq_guest) and defer the setting of the vLPI. This would work
> because the interrupt are disabled at boot time and won't be enabled as
> long as a MAPVI command is sent.

I'm going to try and write some introductory bits about these data
structures and early allocations to try and bring some clarity here so
we can decide what to actually do.

> >>> ### `INVALL` Command translation
> >>
> >> The format of INVALL is INVALL collection
> >>
> >>> A physical `INVALL` is only generated if the LPI dirty bitmap has any
> >>> bits set. Otherwise it is skipped.
> >>>
> >>> XXX Perhaps bitmap should just be a simple counter?
> >>
> >> We would need to handle it per collection.
> > 
> > Hrm, this complicates things a bit. Don't we need to invalidate any pCID
> > which has a routing of an interrupt to vCID? i.e. potentially multiple
> > INVALL?
> 
> Not necessarily. If we keep the interrupts of a vCID always in the same
> pCID, we would need to send only on INVALL.

I think I/we need to spec out the vCID->pCID mapping a bit more clearly
so we can see how this will fit together.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.