[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Xen on ARM vITS Handling Draft B (Was Re: Xen/arm: Virtual ITS command queue handling)



On Tue, 2015-05-12 at 16:02 +0100, Ian Campbell wrote:
> I've written up my thinking as a design doc below (it's pandoc and the
> pdf version is also at
> http://xenbits.xen.org/people/ianc/vits/draftA.pdf FWIW).

Here is a second draft based on the feedback so far. Also at
http://xenbits.xen.org/people/ianc/vits/draftB.{pdf,html}.

So far I think we are mostly at the stage of gather open questions and
enumerate the issues rather than actually beginning reaching any
conclusion. That's OK (and part of the purpose).

Ian.
-----

% Xen on ARM vITS Handling
% Ian Campbell <ian.campbell@xxxxxxxxxx>
% Draft B

# Changelog

## Since Draft A

* Added discussion of when/where command translation occurs.
* Contention on scheduler lock, suggestion to use SOFTIRQ.
* Handling of domain shutdown.
* More detailed discussion of multiple vs single vits pros/cons.

# Introduction

ARM systems containing a GIC version 3 or later may contain one or
more ITS logical blocks. An ITS is used to route Message Signalled
interrupts from devices into an LPI injection on the processor.

The following summarises the ITS hardware design and serves as a set
of assumptions for the vITS software design. (XXX it is entirely
possible I've horribly misunderstood how this stuff fits
together). For full details of the ITS see the "GIC Architecture
Specification".

Message signalled interrupts are translated into an LPI via a
translation table which must be configured for each device which can
generate an MSI. The ITS uses the device id of the originating device
to lookup the corresponding translation table. Devices IDs are
typically described via system firmware, e.g. the ACPI IORT table or
via device tree.

The ITS is configured and managed, including establishing a
Translation Table for each device, via an in memory ring shared
between the CPU and the ITS controller. The ring is managed via the
`GITS_CBASER` register and indexed by `GITS_CWRITER` and `GITS_CREADR`
registers.

A processor adds commands to the shared ring and then updates
`GITS_CWRITER` to make them visible to the ITS controller.

The ITS controller processes commands from the ring and then updates
`GITS_CREADR` to indicate the the processor that the command has been
processed.

Commands are processed sequentially.

Commands sent on the ring include operational commands:

* Routing interrupts to processors;
* Generating interrupts;
* Clearing the pending state of interrupts;
* Synchronising the command queue

and maintenance commands:

* Map device/collection/processor;
* Map virtual interrupt;
* Clean interrupts;
* Discard interrupts;

The ITS provides no specific completion notification
mechanism. Completion is monitored by a combination of a `SYNC`
command and either polling `GITS_CREADR` or notification via an
interrupt generated via the `INT` command.

Note that the interrupt generation via `INT` requires an originating
device ID to be supplied (which is then translated via the ITS into an
LPI). No specific device ID is defined for this purpose and so the OS
software is expected to fabricate one.

Possible ways of inventing such a device ID are:

* Enumerate all device ids in the system and pick another one;
* Use a PCI BDF associated with a non-existent device function (such
  as an unused one relating to the PCI root-bridge) and translate that
  (via firmware tables) into a suitable device id;
* ???

# vITS

A guest domain which is allowed to use ITS functionality (i.e. has
been assigned pass-through devices which can generate MSIs) will be
presented with a virtualised ITS.

Accesses to the vITS registers will trap to Xen and be emulated and a
virtualised Command Queue will be provided.

Commands entered onto the virtual Command Queue will be translated
into physical commands (this translation is described in the GIC
specification).

XXX there are other aspects to virtualising the ITS (LPI collection
management, assignment of LPI ranges to guests, device
management). However these are not currently considered here. XXX
Should they be/do they need to be?

## Requirements

Emulation should not block in the hypervisor for extended periods. In
particular Xen should not busy wait on the physical ITS. Doing so
blocks the physical CPU from doing anything else (such as scheduling
other VCPUS)

There may be multiple guests which have a vITS, all targeting the same
underlying pITS. A single guest VCPU should not be able to monopolise
the pITS via its vITS and all guests should be able to make forward
progress.

## Command Queue Virtualisation

The command queue of each vITS is represented by a data structure:

    struct vits_cq {
        list_head schedule_list; /* Queued onto pits.schedule_list */
        uint32_t creadr;         /* Virtual creadr */
        uint32_t cwriter;        /* Virtual cwriter */
        uint32_t progress;       /* Index of last command queued to pits */
        [ Reference to command queue memory ]
    };

Each pITS has an associated data structure:

    struct pits {
        list_head schedule_list; /* Contains list of vitq_cq.schedule_lists */
        uint32_t last_creadr;
    };

On write to the virtual `CWRITER` the cwriter field is updated and if
that results in there being new outstanding requests then the vits_cq
is enqueued onto pITS' schedule_list (unless it is already there).

On read from the virtual `CREADR` iff the vits_cq is such that
commands are outstanding then a scheduling pass is attempted (in order
to update `vits_cq.creadr`). The current value of `vitq_cq.creadr` is
then returned.

### Command translation

In order to virtualise the Command Queue each command must be
translated (this is described in the GIC spec).

Translation of certain commands can be expensive (XXX citation
needed).

Translation can be done in two places:

* During scheduling.
* On write to `CWRITER`, into a per `vits_cq` queue which the
  scheduler then propagates to the pits.

Doing the translate during scheduling means that potentially expensive
operations may be accounted to `current`, who may have nothing to do
with those operations (this is true whether it is IRQ context or
SOFTIRQ context).

Doing the translate during `CWRITER` emulation accounts it to the
right place, but introduces a potentially long synchronous operation
which ties down a VCPU. Introducing batching here means we have
essentially the same issue wrt when to replenish the translated queue
as doing translate during scheduling.

Translate during `CWRITER` also has memory overheads. Unclear if they
are at a problematic scale or not.

XXX need a solution for this.

XXX Can we arrange a scheme where a pretranslated queue is replensihed
(in batches) only on return to a vcpu owned by that guest (getting
accounting right). This would involve some careful logic to kick vcpus
at partiuclar times, and presumably some spurious wake ups.

### pITS Scheduling

A pITS scheduling pass is attempted:

* On write to any virtual `CWRITER` iff that write results in there
  being new outstanding requests for that vits;
* On read from a virtual `CREADR` iff there are commands outstanding
  on that vits;
* On receipt of an interrupt notification arising from Xen's own use
  of `INT`; (see discussion under Completion)
* On any interrupt injection arising from a guests use of the `INT`
  command; (XXX perhaps, see discussion under Completion)

This may result in lots of contention on the scheduler
locking. Therefore we consider that in each case all which happens is
triggering of a softirq which will be processed on return to guest,
and just once even for multiple events.

Such deferal could be considered OK (XXX ???) for the `CREADR` case
because at worst the value read will be one cycle out of date. A guest
which receives an `INT` notification might reasonably expect a
subsequent read of `CREADR` to reflect that. However that should be
covered by the softint processing which would occur on entry to the
guest to inject the `INT`.

Each scheduling pass will:

* Read the physical `CREADR`;
* For each command between `pits.last_creadr` and the new `CREADR`
  value process completion of that command and update the
  corresponding `vits_cq.creadr`.
* Attempt to refill the pITS Command Queue (see below).

### Domain Shutdown

We can't free a `vits_cq` while has things on the physical control
queue, and we cannot cancel things which are on the control queue.

So we must wait.

Obviously don't enqueue anything new onto the pits if `d->is_dying`.

`domain_relinquish_resources()` waits (somehow, with suitable
continuations etc) for anything which the `vits_cq` has outstanding to
be completed so that the datastructures can be cleared.

### Filling the pITS Command Queue.

Various algorithms could be used here. For now a simple proposal is
to traverse the `pits.schedule_list` starting from where the last
refill finished (i.e not from the top of the list each time).

If a `vits_cq` has no pending commands then it is removed from the
list.

If a `vits_cq` has some pending commands then `min(pits-free-slots,
vits-outstanding, VITS_BATCH_SIZE)` will be taken from the vITS
command queue, translated and placed onto the pITS
queue. `vits_cq.progress` will be updated to reflect this.

Each `vits_cq` is handled in turn in this way until the pITS Command
Queue is full or there are no more outstanding commands.

There will likely need to be a data structure which shadows the pITS
Command Queue slots with references to the `vits_cq` which has a
command currently occupying that slot and corresponding the index into
the virtual command queue, for use when completing a command.

`VITS_BATCH_SIZE` should be small, TBD say 4 or 8.

Possible simplification: If we arrange that no guest ever has multiple
batches in flight (which can occur if we wrap around the list several
times) then we may be able to simplify the book keeping
required. However this may need some careful thought wrt fairness for
guests submitting frequent small batches of commands vs those sending
large batches.

XXX concern: Time spent filling the pITS queue could be significant if
guests are allowed to fill the ring completely.

### Completion

It is expected that commands will normally be completed (resulting in
an update of the corresponding `vits_cq.creadr`) via guest read from
`CREADR`. This will trigger a scheduling pass which will ensure the
`vits_cq.creadr` value is up to date before it is returned.

A guest which does completion via the use of `INT` cannot observe
`CREADR` without reading it, so updating on read from `CREADR`
suffices from the point of view of the guests observation of the
state. (Of course we will inject the interrupt at the designated point
and the guest may well then read `CREADR`)

However in order to keep the pITS Command Queue moving along we need
to consider what happens if there are no `INT` based events nor reads
from `CREADR` to drive completion and therefore refilling of the Queue
with other outstanding commands.

A guest which enqueues some commands and then never checks for
completion cannot itself block things because any other guest which
reads `CREADR` will drive completion. However if _no_ guest reads from
`CREADR` then completion will not occur and this must be dealt with.

Even if we include completion on `INT`-base interrupt injection then
it is possible that the pITS queue may not contain any such
interrupts, either because no guest is using them or because the
batching means that none of them are enqueued on the active ring at
the moment.

So we need a fallback to ensure that queue keeps moving. There are
several options:

* A periodic timer in Xen which runs whenever there are outstanding
  commands in the pITS. This is simple but pretty sucky.
* Xen injects its own `INT` commands into the pITS ring. This requires
  figuring out a device ID to use.

The second option is likely to be preferable if the issue of selecting
a device ID can be addressed.

A secondary question is when these `INT` commands should be inserted
into the command stream:

* After each batch taken from a single `vits_cq`;
* After each scheduling pass;
* One active in the command stream at any given time;

The latter should be sufficient, by arranging to insert a `INT` into
the stream at the end of any scheduling pass which occurs while there
is not a currently outstanding `INT` we have sufficient backstop to
allow us to refill the ring.

This assumes that there is no particular benefit to keeping the
`CWRITER` rolling ahead of the pITS's actual processing. This is true
because the ITS operates on commands in the order they appear in the
queue, so there is no need to maintain a runway ahead of the ITS
processing. (XXX If this is a concern perhaps the INT could be
inserted at the head of the final batch of commands in a scheduling
pass instead of the tail).

Xen itself should never need to issue an associated `SYNC` command,
since the individual guests would need to issue those themselves when
they care. The `INT` only serves to allow Xen to enqueue new commands
when there is space on the ring, it has no interest itself on the
actual completion.

### Locking

It may be preferable to use `atomic_t` types for various fields
(e.g. `vits_cq.creadr`) in order to reduce the amount and scope of
locking required.

### Multiple vITS instances in a single guest

As described above each vITS maps to exactly one pITS (while each pITS
serves multiple vITSs).

It could be possible to arrange that a vITS can enqueue commands to
different pITSs depending on e.g. the device id.

However each approach has issues.

In 1 vITS per pITS:

* Exposing on vITS per pITS means that we are exposing something about
  the underlying hardware to the guest.
* Adds complexity to the guest layout, which is right now static. How
  do you decide the number of vITS/root controller exposed:
    * Hotplug is tricky
* Toolstack needs greater knowledge of the host layout
* Given that PCI passthrough doesn't allow migration, maybe we could
  use the layout of the hardware.

In 1 vITS for all pITS:

* What to do with global commands? Inject to all pITS and then
  synchronise on them all finishing.
* Handling of out of order completion of commands queued with
  different pITS, since the vITS must appear to complete in
  order. Apart from the book keeping question it makes scheduling more
  interesting:
    * What if you have a pITS with slots available, and the guest command
      queue contains commands which could go to the pITS, but behind ones
      which are targetting another pITS which has no slots
    * What if one pITS is very busy and another is mostly idle and a
      guest submits one command to the busy one (contending with other
      guest) followed by a load of commands targeting the idle one. Those
      commands would be held up in this situation.
    * Reasoning about fairness may be harder.

XXX need a solution/decision here.

In addition the introduction of direct interrupt injection in version
4 GICs may imply a vITS per pITS. (Update: it seems not)

### vITS for purely software interrupts (e.g. event channels)

It has been proposed that it might be nice to inject event channels as
LPIs in the future. Whether or not that would involve any sort of vITS
is unclear, but if it did then it would likely be a separate emulation
to the vITS emulation used with a pITS and as such is not considered
further here.

# Glossary

* _MSI_: Message Signalled Interrupt
* _ITS_: Interrupt Translation Service
* _GIC_: Generic Interrupt Controller
* _LPI_: Locality-specific Peripheral Interrupt

# References

"GIC Architecture Specification" PRD03-GENC-010745 24.0



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.