[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] ARM: New (Xen) VGIC design document



Hi Christoffer,

On 12/10/17 13:05, Christoffer Dall wrote:
> Hi Andre,
> 
> On Wed, Oct 11, 2017 at 03:33:03PM +0100, Andre Przywara wrote:
>> Hi,
>>
>> (CC:ing some KVM/ARM folks involved in the VGIC)
> 
> Very nice writeup!
> 
> I added a bunch of comments, mostly for the writing and clarity, I hope
> it helps.

Thank you very much for the response and the comments! I really
appreciate your precise (academic) language here.
I held back the response since Stefano was the actual addressee of this
write-up, so: sorry for the delay.

>> starting with the addition of the ITS support we were seeing more and
>> more issues with the current implementation of our ARM Generic Interrupt
>> Controller (GIC) emulation, the VGIC.
>> Among other approaches to fix those issues it was proposed to copy the
>> VGIC emulation used in KVM. This one was suffering from very similar
>> issues, and a clean design from scratch lead to a very robust and
>> capable re-implementation. Interestingly this implementation is fairly
>> self-contained, so it seems feasible to copy it. Hopefully we only need
>> minor adjustments, possibly we can even copy it verbatim with some
>> additional glue layer code.
>> Stefano asked for getting a design overview, to assess the feasibility
>> of copying the KVM code without reviewing tons of code in the first
>> place.
>> So to follow Xen rules for new features, this design document below is
>> an attempt to describe the current KVM VGIC design - in a hypervisor
>> agnostic session. It is a bit of a retro-fit design description, as it
>> is not strictly forward-looking only, but actually describing the
>> existing implemenation [1].
>>
>> Please have a look and let me know:
>> 1) if this document has the right scope
>> 2) if this document has the right level of detail
>> 3) if there are points missing from the document
>> 3) if the design in general is a fit
>>
>> Appreciate any feedback!
>>
>> Cheers,
>> Andre.
>>
>> ---------------------------------------
>>
>> VGIC design
>> ===========
>>
>> This document describes the design of an ARM Generic Interrupt Controller 
>> (GIC)
>> emulation. It is meant to emulate a GIC for a guest in an virtual machine,
>> the common name for that is VGIC (from "virtual GIC").
>>
>> This design was the result of a one-week-long design session with some
>> engineers in a room, triggered by ever-increasing difficulties in maintaining
>> the existing GIC emulation in the KVM hypervisor. The design eventually
>> materialised as an alternative VGIC implementation in the Linux kernel
>> (merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation
>> was removed, so it is now the current code used by Linux.
>> Although being used in KVM, the actual design of this VGIC is rather 
>> hypervisor
>> agnostic and can be used by other hypervisors as well, in particular for Xen.
>>
>> GIC hardware virtualization support
>> -----------------------------------
>>
>> The ARM Generic Interrupt Controller (since v2) supports the virtualization
>> extensions, which allows some parts of the interrupt life cycle to be handled
>> purely inside the guest without exiting into the hypervisor.
>> In the GICv2 and GICv3 architecture this covers mostly the "interrupt
>> acknowledgement", "priority drop" and "interrupt deactivate" actions.
>> So a guest can handle most of the interrupt processing code without
>> leaving EL1 and trapping into the hypervisor. To accomplish
>> this, the GIC holds so called "list registers" (LRs), which shadow the
>> interrupt state for any virtual interrupt. Injecting an interrupt to a guest
>> involves setting up one LR with the interrupt number, its priority and 
>> initial
>> state (mostly "pending"), then entering the guest. Any EOI related action
>> from within the guest just acts on those LRs, the hypervisor can later update
>> the virtual interrupt state when the guest exists the next time (for whatever
>> reason).
>> But despite the GIC hardware helping out here, the whole interrupt
>> configuration management is not virtualized at all and needs to be emulated
>> by the hypervisor - or another related software component, for instance a
>> userland emulator. This so called "distributor" part of the GIC consists of
>> memory mapped registers, which can be trapped by the hypervisor, so any guest
>> access can be emulated in the usual way.
>>
>> VGIC design motivation
>> ----------------------
>>
>> A GIC emulation thus needs to take care of those bits:
>>
>> - trap GIC distributor MMIO accesses and shadow the configuration setup
>>   (enabled/disabled, level/edge, priority, affinity) for virtual interrupts
>> - handle incoming hardware and virtual interrupt requests and inject the
>>   associated virtual interrupt by manipulating one of the list registers
>> - track the state of a virtual interrupt by inspecting the LRs after the
>>   guest has exited, possibly adjusting the shadowed virtual interrupt state
>>
>> Despite the distributor MMIO register emulation being a sizeable chunk of
>> the emulation, it is actually not dominant if looking at the frequency at
>> which it is accessed. Normally the interrupt configuration is done at boot
>> time or upon initialising the device (driver), but rarely during the actual
>> run time of a system. Injecting and EOI-ing interrupts however happens much
>> more often. A good emulation approach should thus focus on tracking the 
>> virtual
>> interrupt state efficiently, allowing quick handling of incoming and EOI-ed
>> interrupts.
> 
> I would also say that the architecture for the GIC includes a relatively
> high number of corner cases and invariants that may not be violated, and
> being completely architecture compliant was our first requirement, the
> second requirement was to support efficient interrupt life cycle
> management and to be able to quickly tell which (if any) interrupts must
> be presented to a virtual CPU.

100% ACK. Will try to include this.

>> The actual interrupt state tracking can be quite tricky in parts. Interrupt
>> injections can be independent from the guest entry/exit points, also MMIO
>> configuration accesses could be triggered by any VCPU at any point in time.
>> Changing interrupt CPU affinity adds to the complication.
>> This leads to many code parts which could run in parallel and thus contains
>> some race conditions, so proper locking becomes key of a good design.
>> But one has to consider that interrupts in general can be characterised
>> as a rare event - otherwise a guest would be busy handling interrupts and 
>> could
> 
> (across all virtual CPUs)
> 
>> not process actual computation tasks.
>> That's why the interrupt state tracking should focus on a clear and race-free
> 
> nit: not sure it makes sense to talk about a race-free locking scheme.
> You have locking inherently because you have races; locking just makes
> sure that things that race and access data concurrently don't corrupt
> stace and that the races become benign.

Indeed, good point.

>> locking scheme, without needlessly optimising too much in this respect.
>> Experience shows that this complicates the code and leads to undetected and
>> hard-to-debug race conditions, which affect the stability of the system in
>> possibly untested corner cases.
> 
> I think experience also shows that the expected performance bottlenecks
> really weren't there at all, and any optimization efforts should be
> driven by clear measurements of the pain points, falling back to clarity
> of implementation and ease of maintenance for all other parts of the
> implementation.

Yes, it seems obvious, but is indeed something that we need to remember
ourselves of from time to time.

>> VGIC design principles
>> ----------------------
>>
>> ### Data structure
>>
>> This VGIC design is based on the idea of having one structure per virtual
>> interrupt, protected by its own lock. 
> 
> Even more high level: This VGIC design was based around having a very
> clear data structure design, never duplicating state, and making it
> abundantly clear how things are structured.  One way of achieving that
> is to have a structure per interrupt, each having its own lock.
> 
>> In addition there is a list per VCPU,
>> which queues the interrupts which this VCPU should consider for injection.
> 
> nit: Should you introduce the AP list name here, and say that it's protected
> by the VCPU lock?

Yeah, good idea.

>> One interrupt can only be on one VCPU list at any given point in time.
> 
> nit, wording: Any interrupt can be on at most one AP list at any point
> in time.
> 
>> For private interrupts and SPIs a static allocation of this data structure
> 
> nit: PPIs and SPIs (or private and shared interrupts)

Well, with "private interrupts" I included SGIs as well, basically every
IRQ that is static and has a well bounded upper limit.

>> would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very
>> dynamic and possibly very sparse allocation scheme, so we need to deal with
>> dynamic allocation and de-allocation of this struct. To accommodate this
>> there is an additional list header to link all LPIs.
>> Also the LPI mapping and unmapping can happen asynchronously, so we need to
> 
> asynchronously to what?

Asynchronously to managing virtual interrupts, i.e. in the middle of the
HV running or even during the actual injection code.

>> properly ref-count the structure (at least for LPIs), otherwise some code 
>> parts
> 
> nit, wording: reference count.
> 
>> would potentially end up with referencing an already freed pointer.
> 
> It's not only that, it's that you need to know when to free things.
> This is the basic idea of reference counting which I don't think you
> need to argue for in this document.

Agreed, and since Stefano obviously does not need convincing on this
front ;-), I can shorten the argument here.

>>
>> The central data structure is called `struct vgic_irq`, and, beside the
>> expected interrupt configuration data, contains at least the lock, a list
>> header (to be able to link it to a VCPU) and a refcount. Also it contains
>> the interrupt number (to accommodate for non-contiguous interrupt 
>> allocations,
>> for instance for LPIs).
>> Beside those essential elements it proves worth to store (a reference to) the
>> VCPU this IRQ is associated with. This allows to easily find the respective
>> VCPU list.
>>
>>     struct vgic_irq {
>>         spinlock_t irq_lock;            /* Protects the content of the 
>> struct */
>>         struct list_head lpi_list;      /* Used to link all LPIs together */
>>         struct list_head ap_list;
>>
>>         struct vcpu *vcpu;              /* SGIs and PPIs: The VCPU
>>                                          * SPIs and LPIs: The VCPU whose 
>> ap_list
>>                                          * this is queued on.
>>                                          */
>>
>>         struct vcpu *target_vcpu;        /* The VCPU that this interrupt 
>> should
>>                                           * be sent to, as a result of the
>>                                           * targets reg (v2) or the
>>                                           * affinity reg (v3).
>>                                           */
>>
>>         u32 intid;                      /* Guest visible INTID */
>>         bool line_level;                /* Level only */
>>         bool pending_latch;             /* The pending latch state used to
>>                                          * calculate the pending state for
>>                                          * both level and edge triggered 
>> IRQs.
>>                                          */
>>
>>         bool active;                    /* not used for LPIs */
>>         bool enabled;
>>         bool hw;                        /* Tied to HW IRQ */
>>         struct kref refcount;           /* Used for LPIs */
>>         u32 hwintid;                    /* HW INTID number */
>>         union {
>>             u8 targets;                     /* GICv2 target VCPUs mask */
>>             u32 mpidr;                      /* GICv3 target VCPU */
>>         };
>>         u8 source;                      /* GICv2 SGIs only */
>>         u8 priority;
>>         enum vgic_irq_config config;    /* Level or edge */
>>     };
>>
>> ### VCPU list handling
>>
>> Initially a virtual interrupt just lives on its own. 
> 
> not sure what this means, see if you can clarify by being more concrete.

Yeah, I didn't find any better wording. Basically that struct vgic_irq
is not connected to anything, and is self-absorbed with playing with its
fields and the lock ;-)

>> Guest MMIO accesses to
>> the distributor will change the state information in this structure.
>> When an interrupt is actually made pending (either by an associated hardware
>> IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be
> 
> I think the distinction of what causes an interrupt to be fired should
> be reworked in the document.  The important bit is that the VGIC has a 
> virtual interrupt input line, which can be raised and lowered, which the
> hypervisor can use to signal virtual interrupts.  These may or may not
> be tied to a physical interrupt, and they may therefore be marked as
> hw=true or hw=false, respectively.

OK, that's a good point. With the Xen focus on hw=1 interrupts, I got a
bit confused, but indeed it's still a virtual interrupt, which needs to
be triggered by the hypervisor. The connection to a physical IRQ is
something on top of this, for later in the IRQ life cycle.
Thanks for spelling this out!

>> linked to the current target VCPU. The `vcpu` member in the structure will
>> be set to this VCPU. Any affinity change after this point will not affect
>> the current target VCPU anymore, it just updates the `target_vpu` field in
>> the structure, which will be considered on the next injection.
> 
> I think this description is a little vague.  There are clear semantics
> associated with these two fields:
> 
>    vcpu: The VCPU whose ap_list this interrupt is queued on (which
>          happens to be immutable for SGIs and PPIs)
> 
>    target_vcpu: For SGIs and LPIs, the configured target VCPU for an
>                 interrupt.
> 
> Once this is clearly defined, there are some rules in terms of when the
> vcpu field can be changed; when queing a virtual interrupt for delivery
> (because it's pending and/or active), the vcpu field field points to the
> VCPU on which it is queud.  The target_vcpu field simply records the
> configuration, and can be changed by the hypervisor or the VM itself at
> any time, but only the VCPU on whose AP list the virtual interrupt is
> already queued, can change a non-NULL vcpu field to NULL or to a
> different value, i.e. migrate the virtual interrupt.
> 
> This is a requirement to ensure correct functionality; once you present
> an active interrupt to a VCPU, you cannot take it away behind its back,
> but you have to wait until the VCPU deactivates the interrupt.
> 
>> This per-VCPU list is called the `ap_list`, since it holds interrupts which
>> are in a pending and/or active state.
>>
>> ### Virtual IRQ references
>>
>> There is a function `vgic_get_irq()` which returns a reference to a virtual 
>> IRQ
>> given its number.
>> For private IRQs and SPIs it is expected that this just indexes a static 
>> array.
>> For LPIs (which are dynamically allocated at run time) this is expected to
>> iterate a data structure (like a linked list) to find the right structure.
> 
> (or allocate one?)

Mmh, we don't in our code: vgic_get_lpi() returns NULL if the LPI is not
found, and needs vgic_add_lpi() to explicitly allocate one.
I guess an implementation could choose to automatically allocate one,
but this would be an implementation detail, wouldn't it?

>> In any case a call to `vgic_get_irq` will increase a refcount, which will
>> prevent LPIs from being de-allocated while another part of the VGIC is still
>> holding a reference. Thus any caller to `vgic_get_irq` shall call
>> `vgic_put_irq()` after it is done with handling this interrupt.
> 
> Isn't this refcounting 101?  I assume it's already used in Xen and the
> rationale could be skipped here in the interest of focus.

Yes, but this "call vgic_put_irq() after it is done" has consequences
for the code: you have to do it explicitly and cannot just return a
pointer from a function without passing the responsibility of "putting
it" to the caller. I found this noteworthy, since the Xen VGIC code does
violate this principle at the moment (which is fine since Xen does not
need it).

>> An exception would be if the virtual IRQ is eventually injected into a VCPU. 
>> In
>> this case the VCPU holds that reference and it is kept as long as the guest
>> sees this virtual IRQ. The refcount would only be decreased upon the IRQ 
>> having
>> been EOIed by the guest and it having been removed from the VCPU list.
> 
> Again, this seems to just explain an example of one of the references.
> Reference counting works by counting references, freeing the resource
> when the refecence reaches zero.  That's about it.

OK, I guess I can shorten the lengthy CS lecture here, just adding that
being on an ap_list requires us to keep the reference, even if we leave
the function and enter a VCPU.
As mechanically one would pair a put() with a get() in a (leaf)
function, the deviating need for spreading gets and puts over different
code parts seems to be worthwhile to be mentioned here explicitly.

>> ### Locking
>>
>> To keep the `vgic_irq` structure consistent and to avoid races between
>> different parts of the VGIC, locking is essential whenever accessing a member
> 
> nit: again a race cannot be avoided completely, but they can be made
> benign...

Right.

>> of this structure. It is expected that this lock is almost never contended,
>> also held only for brief periods of time, so this is considered cheap.
>> To keep the code clean and avoid nasty corner cases, there are no tricks on
>> trying to be lockless here.
>> If for any reason the code needs to hold the locks for two virtual IRQs, the
>> one with the lower IRQ number is to be taken first, to avoid deadlocks.
>>
>> Another lock to consider is the VCPU lock, which on the first glance protects
>> the virtual CPU's list structure, but also synchronises additions and 
>> removals
>> of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ
>> lock need to be held. To avoid deadlocks, there is a strict locking order:
>>
>>> The VCPU lock needs to be taken first, the per-IRQ lock after this.
>>
>> Some operations (like migrating IRQs between two VCPUs) require two VCPU
>> locks to be held, in this case the lock for the VCPU with the smaller VCPU ID
>> is to be taken first.
>>
>> There are occasions where the locking order (VCPU first) is hard to observe,
>> because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU
>> list. In this case the IRQ lock needs to be dropped, the respective VCPU
>> lock should be taken, then the per-IRQ lock needs to be re-taken.
>> After both the locks are held, we need to check if the conditions which
>> originally mandated the list addition (or removal) are still true. This is
>> needed because the IRQ lock could have been taken by another entity meanwhile
>> and the state of this interrupt could have been changed. Examples are if the
>> interrupt is no longer pending, got disabled or changed the CPU affinity.
>> Some of those changes might render to current action obsolete (no longer
>> pending), other will lead to a retry of the re-locking scheme described 
>> above.
>> This re-locking scheme shall be implemented in a well-documented function.
> 
> Do we have this documentation on the KVM side that you could link to
> here for people to have an understanding of how this can be explained?
> It's not that bad when you look at it really.

Well, there is this function with 80% comments ;-)
But I found it a bit cumbersome to refer to some existing implementation
in a design document. Maybe I could add some pseudo code here?

>>
>> ### Level and edge triggered interrupts
>>
>> The GIC knows about two kinds of signalling interrupts:
>>
>> - Edge triggered interrupts are triggered by a device once, their life cycle
>> ends when the guest has EOIed them, at which point we remove the pending 
>> state,
>> clear the LR and return the `vgic_irq` structure to a quiescent state.
> 
> For non-HW interrupts, you have the added potential complexity of
> PENDING+ACTIVE.

Well, I wanted to introduce level and edge triggered IRQs here, not
completely alienate the reader ;-)
I guess I shorten this to the text book definition of level vs. edge,
and details on those cases later.

>> - Level triggered interrupts are triggered when a device raises its interrupt
>> line, they stay pending as long as this line is held high. At some point the
>> driver in the guest is expected to program the device to explicitly or
>> implicitly lower this interrupt line. That means that we have to store the
>> state of the virtual interrupt line, which is only controlled by the 
>> (virtual)
>> device. This is done in the `line_level` member of `struct vgic_irq`.
>>
>> To assert the interrupt condition, a (virtual) device calls a function 
>> exported
>> by the VGIC, which allows to raise or lower an interrupt line. Lowering the
>> line for an edge triggered IRQ is ignored (and so is optional). Raising the
>> line asserts the pending state and potentially injects this virtual IRQ. Any
>> subsequent "raising" call might inject another IRQ, if the previous has at
>> least been activated by the guest already, otherwise is ignored.
>>
>> For level triggered interrupts this function stores the new state into the
>> `line_level` variable, potentially injecting the interrupt if that line
>> changes from false to true. If the line is lowered before the guest has
>> seen it, this particular interrupt instance will be discarded. Successive
>> "raising" calls will not lead to multiple interrupts if the line has not
>> been lowered in between.
> 
> This is confusing:

Even for me, apparently ;-)

> Lowering or raising the line for a level triggered
> interrupt doesn't make any difference.  The point is that as long as the
> line is high, if you deactivate that interrupt, a new interrupt will hit
> immediately again, unless the line has been lowered in the meantime.

Yeah, it seems I need to seriously rework this paragraph.

>>
>> ### Software triggered interrupts
>>
>> Beside the naturally software triggered inter-processor-interrupts
>> (SGIs in GIC speak), there is another way of letting software raise an
>> interrupt condition.
> 
> These three lines appear to belong to the heading...
> 
>> The GIC distributor allows to set or clear both the pending and active state
>> of any interrupt via MMIO registers. This isn't widely used by many operating
>> systems, but is useful when saving and restoring the state of a machine.
>> So emulating these functions is required for being architecture compliant,
>> however the implementation might not need to be very efficient given its rare
>> usage. In fact supporting the set-pending and clear-pending registers is
>> relatively straight-forward, as long as one keeps this state separate from
>> the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`.
>>
>> The set-active and clear-active registers are much harder to emulate, though,
>> as normally the active state is of little concern to the GIC emulation. In
>> a normal interrupt life cycle the active state isn't even visible to the
>> hypervisor, as it might be set and cleared again entirely within the guest
>> in the list register, without exiting to the hypervisor.
>> So manipulating the active state via the MMIO registers requires some heavy
>> lifting: If this interrupt is currently injected into a running VCPU, this
>> VCPU must exit, the active state must be set or cleared in the LR, then
>> execution can continue. While this is expensive, as mentioned above this
>> should not happen too often, also probably the system isn't very performance
>> sensitive when using this feature for save and restore anyway.
> 
> These two paragraphs not so much, they seem to belong to MMIO emulation,
> and should probably follow the paragraph below.

Yes, indeed it makes sense to move them.

>> ### MMIO emulation
>>
>> As mentioned before, the distributor and redistributor part of the VGIC needs
>> to be fully emulated. Those parts are characterised by a range of MMIO
>> registers. The implementation shall provide a dispatcher function, which
>> takes the faulted address, relative to the beginning of the MMIO range, and
>> works out which actual register is affected. It then looks up the the
>> respective handler function and calls it. Those functions are expected to
>> be listed in a struct initialiser, which connects the actual register
>> offset and its size to a particular handler. Having handler functions for
>> a register range seems beneficial over handling registers in a switch/case,
>> because it's easier to read and simplifies code sharing, for instance
>> between the GICv2, GICv3 distributor and GICv3 redistributor registers
>> with the same semantics.
>>
>> ### List register management
>>
>> A list register (LR) holds the state of a virtual interrupt, which will
>> be used by the GIC hardware to simulate an IRQ life cycle for a guest.
>> Each GIC hardware implementation can choose to implement a number of LRs,
>> having four of them seems to be a common value. This design here does not
>> try to manage the LRs very cleverly, instead on every guest exit every LR
>> in use will be synced to the emulated state, then cleared. 
> 
> In fact I think we came up with counter-examples for every model of
> being clever with not reading back the LRs, because you simply have to
> observe any change in state that happened in hardware while the guest is
> running, to be able to properly emulate compliant functionality of being
> able to inject interrupts or not.

Adding another one:
https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/arm/gic.c;hb=HEAD#l622

Sorry, Stefano, couldn't resist ;-)

>> Upon guest entry
>> the top priority virtual IRQs will be inserted into the LRs. If there are
>> more pending or active IRQs than list registers, the GIC management IRQ
>> will be configured to notify the hypervisor of a free LR (once the guest
>> has EOIed one IRQ). This will trigger a normal exit, which will go through
>> the normal cleanup/repopulate scheme, possibly now queuing the leftover
>> interrupt(s).
>> To facilitate quick guest exit and entry times, the VGIC maintains the list
>> of pending or active interrupts (ap\_list) sorted by their priority. Active
>> interrupts always go first on the list, since a guest and the hardware GIC
>> expect those to stay until they have been explicitly deactivated. Failure
>> in keeping active IRQs around will result in error conditions in the GIC.
>> The second sort criteria for the ap\_list is their priority, so higher
>> priority pending interrupt always go first into the LRs.
> 
> 
> Otherwise, as I said, this is a really nice writeup.

Thanks, much appreciated!

Cheers,
Andre.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.