[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen/arm: Virtual ITS command queue handling

On Fri, May 15, 2015 at 4:58 PM, Ian Campbell <ian.campbell@xxxxxxxxxx> wrote:
> On Wed, 2015-05-13 at 21:57 +0530, Vijay Kilari wrote:
>> > * On receipt of an interrupt notification arising from Xen's own use
>> >   of `INT`; (see discussion under Completion)
>>     If INT notification method is used, then I don't think there is need
>> for pITS scheduling on CREADER read.
>> As we discussed in patch #13. Below steps should be suffice to virtualize
>> command queue.
>> 1) On each guest CWRITER update, Read batch ( 'm' commands) of commands
>>     and translate it and put on pITS schedule list. If there are more than 
>> 'm'
>>     commands create m/n entries in schedule list. Append INT command for each
>>      schedule list entry
> How many INT commands do you mean here?

   One INT command (Xen's completion INT) per batch

>>      1a) If there is no ongoing command from this vITS on physical queue,
>>            send to physical queue.
>>      1b) If there is ongoing command return to guest.
>> 2) On receiving completion interrupt, update CREADER of guest and post next
>>     command from schedule list to physical queue.
>> With this,
>>    - There will be no overhead of translating command in interrupt context
>> which is quite heavy because translating ITS command requires validating
>> and updating interval ITS structures.
> Can you give some examples of the heaviest translations please so I can
> get a feel for actually how expensive we are talking here.
    For example to translate MAPVI device_ID, event_ID, vID, vCID

    1) Read from vITS command queue
    2) Validate device_ID is valid by looking at device list attached
to that domain (vITS)
    3) Validate vCID (virtual Collection ID) by checking against
re-distributor address/cpu numbers
        of this domain
    4) Allocate physical LPI for the vID (virtual LPI) from lpi map of
this device
           - Check if virtual LPI is already allocated from this device.
           - If not allocate it
           - Update lpi entries for this device
    5) Allocate memory for physical LPI descriptor (Add radix tree
entry) and populate it
    6) Call route_irq_to_guest() for this LPI
    7) Format physical ITS command and send to pITS

>>    - Always only one request from guest will be posted to physical queue
>>    - Even in guest floods with large number of commands, all the commands
>>      will be translated and queued in schedule list and posted batch by batch
>>    - Scheduling pass is called only on CWRITER & completion INT.
> I think the main difference in what you propose here is that commands
> are queued in pre-translated form to be injected (cheaply) during
> scheduling as opposed to being left on the guest queue and translated
> directly into the pits queue.
> I think `INT` vs `CREADR` scheduling is largely orthogonal to that.
> Julien proposed moving scheduling to a softirq, which gets it out of IRQ
> context (good) but does necessarily account the translation to the
> guest, which is a benefit of your approach. (I think things wihch happen
> in a sortirq are implicitly accounted  to current, whoever that may be)
   one softirq that looks at the all the vITS and posts the commands to pITS?
or one softirq per vITS?

> On the downside pretranslation adds memory overhead and reintroduces the
> issue of a potentially long synchronous translation during `CWRITER`
> handling.

   Memory that is allocated is freed after completion of that batch.
  The translation duration depends on how many commands guest is
writing before updated CWRITER.

> We could pretranslate a batch of commands into a s/w queue rather than
> into the pits queue, but then we are back to where do we refill that
> queue from.
> The first draft wasn't particular clear on when translation occurs
> (although I intended it to be during scheduling). I shall add some
> treatment of that to the next draft.
>> > * On any interrupt injection arising from a guests use of the `INT`
>> >   command; (XXX perhaps, see discussion under Completion)
>> >
>> > Each scheduling pass will:
>> >
>> > * Read the physical `CREADR`;
>> > * For each command between `pits.last_creadr` and the new `CREADR`
>> >   value process completion of that command and update the
>> >   corresponding `vits_cq.creadr`.
>> > * Attempt to refill the pITS Command Queue (see below).
>> >
>> > ### Filling the pITS Command Queue.
>> >
>> > Various algorithms could be used here. For now a simple proposal is
>> > to traverse the `pits.schedule_list` starting from where the last
>> > refill finished (i.e not from the top of the list each time).
>> >
>> > If a `vits_cq` has no pending commands then it is removed from the
>> > list.
>> >
>> > If a `vits_cq` has some pending commands then `min(pits-free-slots,
>> > vits-outstanding, VITS_BATCH_SIZE)` will be taken from the vITS
>> > command queue, translated and placed onto the pITS
>> > queue. `vits_cq.progress` will be updated to reflect this.
>> >
>> > Each `vits_cq` is handled in turn in this way until the pITS Command
>> > Queue is full or there are no more outstanding commands.
>> >
>> > There will likely need to be a data structure which shadows the pITS
>> > Command Queue slots with references to the `vits_cq` which has a
>> > command currently occupying that slot and corresponding the index into
>> > the virtual command queue, for use when completing a command.
>> >
>> > `VITS_BATCH_SIZE` should be small, TBD say 4 or 8.
>> >
>> > Possible simplification: If we arrange that no guest ever has multiple
>> > batches in flight (which can occur if we wrap around the list several
>> > times) then we may be able to simplify the book keeping
>> > required. However this may need some careful thought wrt fairness for
>> > guests submitting frequent small batches of commands vs those sending
>> > large batches.
>>   If one LPI of the dummy device assigned to one VM, then book keeping
>> per vITS becomes simple
> What dummy device do you mean? What simplifications does it imply?

  I mean fake device (non-existent device)  to generate completion INT.
Using unique completion INT for every vITS, then book keeping would be
simple. This helps to identify vITS on receiving completion INT (Completion INT
<=> vITS mapping)

>> >
>> > ### Completion
>> >
>> > It is expected that commands will normally be completed (resulting in
>> > an update of the corresponding `vits_cq.creadr`) via guest read from
>> > `CREADR`. This will trigger a scheduling pass which will ensure the
>> > `vits_cq.creadr` value is up to date before it is returned.
>> >
>>     If guest is CREADR to know completion of command, no need
>> of scheduling pass if INT is used.
> We cannot know apriori which scheme a guest is going to use, nor do we
> have the freedom to mandate a particular scheme, or even that the guest
> uses the same scheme for every batch of commands.
> So we need to design a system which works whether all guests use only
> INT or all guests using only CREADR polling or anything in between.
> A scheduling pass is not needed on INT injection (either Xen's or the
> guests) in order to update `CREADR` (as you suggest), however it may be
> necessary in order to keep the pITS command queue moving by scheduling
> any outstanding commands. Consider the case of a guest which receives an
> INT but does not subsequently read `CREADR` (at all or in a timely
> manner).

  Scheduling outstanding commands and updating CREADER
is always done by Xen's completion INT.
So even if guest does not read CREADER it does not matter.

One corner case I think of is if guest is using INT method to know the
completion of command and if guest's INT command is received before
Xen's completion INT arrives, in that case guest might see old CREADER.
To handle this scenario, we can prefix Xen's completion INT before guest INT

>> > A guest which does completion via the use of `INT` cannot observe
>> > `CREADR` without reading it, so updating on read from `CREADR`
>> > suffices from the point of view of the guests observation of the
>> > state. (Of course we will inject the interrupt at the designated point
>> > and the guest may well then read `CREADR`)
>>    Append Xen completion INT before guest INT command which
>> will update CREADER correctly before guest receives INT
> That means two interrupts. And there is no need because even with the
> guest's own completion INT it won't see things until it reads CREADR
> itself.
>> > However in order to keep the pITS Command Queue moving along we need
>> > to consider what happens if there are no `INT` based events nor reads
>> > from `CREADR` to drive completion and therefore refilling of the Queue
>> > with other outstanding commands.
>> >
>> > A guest which enqueues some commands and then never checks for
>> > completion cannot itself block things because any other guest which
>> > reads `CREADR` will drive completion. However if _no_ guest reads from
>> > `CREADR` then completion will not occur and this must be dealt with.
>> >
>>    Do you mean CREADR of guest should check all the vITS of other
>> guests to post pending commands?
> In the proposal `CREADR` kicks off a scheduling pass, which is
> independent of any particular vITS and operates only on the list of
> scheduled vits, decoupling the vits from the pits scheduling.
>> > Even if we include completion on `INT`-base interrupt injection then
>> > it is possible that the pITS queue may not contain any such
>> > interrupts, either because no guest is using them or because the
>> > batching means that none of them are enqueued on the active ring at
>> > the moment.
>> >
>> > So we need a fallback to ensure that queue keeps moving. There are
>> > several options:
>> >
>> > * A periodic timer in Xen which runs whenever there are outstanding
>> >   commands in the pITS. This is simple but pretty sucky.
>> > * Xen injects its own `INT` commands into the pITS ring. This requires
>> >   figuring out a device ID to use.
>> >
>> > The second option is likely to be preferable if the issue of selecting
>> > a device ID can be addressed.
>> >
>> > A secondary question is when these `INT` commands should be inserted
>> > into the command stream:
>> >
>> > * After each batch taken from a single `vits_cq`;
>> > * After each scheduling pass;
>> > * One active in the command stream at any given time;
>> >
>> > The latter should be sufficient, by arranging to insert a `INT` into
>> > the stream at the end of any scheduling pass which occurs while there
>> > is not a currently outstanding `INT` we have sufficient backstop to
>> > allow us to refill the ring.
>> >
>> > This assumes that there is no particular benefit to keeping the
>> > `CWRITER` rolling ahead of the pITS's actual processing. This is true
>> > because the IRS operates on commands in the order they appear in the
>> > queue, so there is no need to maintain a runway ahead of the ITS
>> > processing. (XXX If this is a concern perhaps the INT could be
>> > inserted at the head of the final batch of commands in a scheduling
>> > pass instead of the tail).
>> >
>> > Xen itself should never need to issue an associated `SYNC` command,
>> > since the individual guests would need to issue those themselves when
>> > they care. The `INT` only serves to allow Xen to enqueue new commands
>> > when there is space on the ring, it has no interest itself on the
>> > actual completion.
>> >
>> > ### Locking
>> >
>> > It may be preferable to use `atomic_t` types for various fields
>> > (e.g. `vits_cq.creadr`) in order to reduce the amount and scope of
>> > locking required.
>> >
>> > ### Multiple vITS instances in a single guest
>> >
>> > As described above each vITS maps to exactly one pITS (while each pITS
>> > servers multiple vITSs).
>> >
>>   IMO, one vITS per domain should be OK. For each command based
>> on the device ID, VITS will query PCI fwk, to know physical ITS
>> on which this device is attached and command will be sent to particular
>> pITS.
>> There are some expection like SYNC, INVALL which does not have
>> device id. In this case these commands are sent on all pITS in the platform.
>> (XXX: If a command is sent to all pITS, how to identify if command is
>> processed on all pITS?.)
> That's one potential issue. I mentioned a couple of others in my reply
> to Julien just now.
> Draft B will have more discussion of these cases, but so far no firm
> solution I think.
> Ian.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.