Xen project Mailing List

Re: [Xen-devel] [RFC] ARM: New (Xen) VGIC design document

From: Andre Przywara <andre.przywara@xxxxxxxxxx>

Date: Wed, 1 Nov 2017 17:54:27 +0000

Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, Marc Zyngier <marc.zyngier@xxxxxxx>, Julien Grall <julien.grall@xxxxxxxxxx>, Eric Auger <eric.auger@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Christoffer Dall <christoffer.dall@xxxxxxxxxx>

Delivery-date: Wed, 01 Nov 2017 17:50:33 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hi Christoffer, On 12/10/17 13:05, Christoffer Dall wrote: > Hi Andre, > > On Wed, Oct 11, 2017 at 03:33:03PM +0100, Andre Przywara wrote: >> Hi, >> >> (CC:ing some KVM/ARM folks involved in the VGIC) > > Very nice writeup! > > I added a bunch of comments, mostly for the writing and clarity, I hope > it helps. Thank you very much for the response and the comments! I really appreciate your precise (academic) language here. I held back the response since Stefano was the actual addressee of this write-up, so: sorry for the delay. >> starting with the addition of the ITS support we were seeing more and >> more issues with the current implementation of our ARM Generic Interrupt >> Controller (GIC) emulation, the VGIC. >> Among other approaches to fix those issues it was proposed to copy the >> VGIC emulation used in KVM. This one was suffering from very similar >> issues, and a clean design from scratch lead to a very robust and >> capable re-implementation. Interestingly this implementation is fairly >> self-contained, so it seems feasible to copy it. Hopefully we only need >> minor adjustments, possibly we can even copy it verbatim with some >> additional glue layer code. >> Stefano asked for getting a design overview, to assess the feasibility >> of copying the KVM code without reviewing tons of code in the first >> place. >> So to follow Xen rules for new features, this design document below is >> an attempt to describe the current KVM VGIC design - in a hypervisor >> agnostic session. It is a bit of a retro-fit design description, as it >> is not strictly forward-looking only, but actually describing the >> existing implemenation [1]. >> >> Please have a look and let me know: >> 1) if this document has the right scope >> 2) if this document has the right level of detail >> 3) if there are points missing from the document >> 3) if the design in general is a fit >> >> Appreciate any feedback! >> >> Cheers, >> Andre. >> >> --------------------------------------- >> >> VGIC design >> =========== >> >> This document describes the design of an ARM Generic Interrupt Controller >> (GIC) >> emulation. It is meant to emulate a GIC for a guest in an virtual machine, >> the common name for that is VGIC (from "virtual GIC"). >> >> This design was the result of a one-week-long design session with some >> engineers in a room, triggered by ever-increasing difficulties in maintaining >> the existing GIC emulation in the KVM hypervisor. The design eventually >> materialised as an alternative VGIC implementation in the Linux kernel >> (merged into Linux v4.7). As of Linux v4.8 the previous VGIC implementation >> was removed, so it is now the current code used by Linux. >> Although being used in KVM, the actual design of this VGIC is rather >> hypervisor >> agnostic and can be used by other hypervisors as well, in particular for Xen. >> >> GIC hardware virtualization support >> ----------------------------------- >> >> The ARM Generic Interrupt Controller (since v2) supports the virtualization >> extensions, which allows some parts of the interrupt life cycle to be handled >> purely inside the guest without exiting into the hypervisor. >> In the GICv2 and GICv3 architecture this covers mostly the "interrupt >> acknowledgement", "priority drop" and "interrupt deactivate" actions. >> So a guest can handle most of the interrupt processing code without >> leaving EL1 and trapping into the hypervisor. To accomplish >> this, the GIC holds so called "list registers" (LRs), which shadow the >> interrupt state for any virtual interrupt. Injecting an interrupt to a guest >> involves setting up one LR with the interrupt number, its priority and >> initial >> state (mostly "pending"), then entering the guest. Any EOI related action >> from within the guest just acts on those LRs, the hypervisor can later update >> the virtual interrupt state when the guest exists the next time (for whatever >> reason). >> But despite the GIC hardware helping out here, the whole interrupt >> configuration management is not virtualized at all and needs to be emulated >> by the hypervisor - or another related software component, for instance a >> userland emulator. This so called "distributor" part of the GIC consists of >> memory mapped registers, which can be trapped by the hypervisor, so any guest >> access can be emulated in the usual way. >> >> VGIC design motivation >> ---------------------- >> >> A GIC emulation thus needs to take care of those bits: >> >> - trap GIC distributor MMIO accesses and shadow the configuration setup >> (enabled/disabled, level/edge, priority, affinity) for virtual interrupts >> - handle incoming hardware and virtual interrupt requests and inject the >> associated virtual interrupt by manipulating one of the list registers >> - track the state of a virtual interrupt by inspecting the LRs after the >> guest has exited, possibly adjusting the shadowed virtual interrupt state >> >> Despite the distributor MMIO register emulation being a sizeable chunk of >> the emulation, it is actually not dominant if looking at the frequency at >> which it is accessed. Normally the interrupt configuration is done at boot >> time or upon initialising the device (driver), but rarely during the actual >> run time of a system. Injecting and EOI-ing interrupts however happens much >> more often. A good emulation approach should thus focus on tracking the >> virtual >> interrupt state efficiently, allowing quick handling of incoming and EOI-ed >> interrupts. > > I would also say that the architecture for the GIC includes a relatively > high number of corner cases and invariants that may not be violated, and > being completely architecture compliant was our first requirement, the > second requirement was to support efficient interrupt life cycle > management and to be able to quickly tell which (if any) interrupts must > be presented to a virtual CPU. 100% ACK. Will try to include this. >> The actual interrupt state tracking can be quite tricky in parts. Interrupt >> injections can be independent from the guest entry/exit points, also MMIO >> configuration accesses could be triggered by any VCPU at any point in time. >> Changing interrupt CPU affinity adds to the complication. >> This leads to many code parts which could run in parallel and thus contains >> some race conditions, so proper locking becomes key of a good design. >> But one has to consider that interrupts in general can be characterised >> as a rare event - otherwise a guest would be busy handling interrupts and >> could > > (across all virtual CPUs) > >> not process actual computation tasks. >> That's why the interrupt state tracking should focus on a clear and race-free > > nit: not sure it makes sense to talk about a race-free locking scheme. > You have locking inherently because you have races; locking just makes > sure that things that race and access data concurrently don't corrupt > stace and that the races become benign. Indeed, good point. >> locking scheme, without needlessly optimising too much in this respect. >> Experience shows that this complicates the code and leads to undetected and >> hard-to-debug race conditions, which affect the stability of the system in >> possibly untested corner cases. > > I think experience also shows that the expected performance bottlenecks > really weren't there at all, and any optimization efforts should be > driven by clear measurements of the pain points, falling back to clarity > of implementation and ease of maintenance for all other parts of the > implementation. Yes, it seems obvious, but is indeed something that we need to remember ourselves of from time to time. >> VGIC design principles >> ---------------------- >> >> ### Data structure >> >> This VGIC design is based on the idea of having one structure per virtual >> interrupt, protected by its own lock. > > Even more high level: This VGIC design was based around having a very > clear data structure design, never duplicating state, and making it > abundantly clear how things are structured. One way of achieving that > is to have a structure per interrupt, each having its own lock. > >> In addition there is a list per VCPU, >> which queues the interrupts which this VCPU should consider for injection. > > nit: Should you introduce the AP list name here, and say that it's protected > by the VCPU lock? Yeah, good idea. >> One interrupt can only be on one VCPU list at any given point in time. > > nit, wording: Any interrupt can be on at most one AP list at any point > in time. > >> For private interrupts and SPIs a static allocation of this data structure > > nit: PPIs and SPIs (or private and shared interrupts) Well, with "private interrupts" I included SGIs as well, basically every IRQ that is static and has a well bounded upper limit. >> would be sufficient, however LPIs (triggered by a (virtual) ITS) have a very >> dynamic and possibly very sparse allocation scheme, so we need to deal with >> dynamic allocation and de-allocation of this struct. To accommodate this >> there is an additional list header to link all LPIs. >> Also the LPI mapping and unmapping can happen asynchronously, so we need to > > asynchronously to what? Asynchronously to managing virtual interrupts, i.e. in the middle of the HV running or even during the actual injection code. >> properly ref-count the structure (at least for LPIs), otherwise some code >> parts > > nit, wording: reference count. > >> would potentially end up with referencing an already freed pointer. > > It's not only that, it's that you need to know when to free things. > This is the basic idea of reference counting which I don't think you > need to argue for in this document. Agreed, and since Stefano obviously does not need convincing on this front ;-), I can shorten the argument here. >> >> The central data structure is called `struct vgic_irq`, and, beside the >> expected interrupt configuration data, contains at least the lock, a list >> header (to be able to link it to a VCPU) and a refcount. Also it contains >> the interrupt number (to accommodate for non-contiguous interrupt >> allocations, >> for instance for LPIs). >> Beside those essential elements it proves worth to store (a reference to) the >> VCPU this IRQ is associated with. This allows to easily find the respective >> VCPU list. >> >> struct vgic_irq { >> spinlock_t irq_lock; /* Protects the content of the >> struct */ >> struct list_head lpi_list; /* Used to link all LPIs together */ >> struct list_head ap_list; >> >> struct vcpu *vcpu; /* SGIs and PPIs: The VCPU >> * SPIs and LPIs: The VCPU whose >> ap_list >> * this is queued on. >> */ >> >> struct vcpu *target_vcpu; /* The VCPU that this interrupt >> should >> * be sent to, as a result of the >> * targets reg (v2) or the >> * affinity reg (v3). >> */ >> >> u32 intid; /* Guest visible INTID */ >> bool line_level; /* Level only */ >> bool pending_latch; /* The pending latch state used to >> * calculate the pending state for >> * both level and edge triggered >> IRQs. >> */ >> >> bool active; /* not used for LPIs */ >> bool enabled; >> bool hw; /* Tied to HW IRQ */ >> struct kref refcount; /* Used for LPIs */ >> u32 hwintid; /* HW INTID number */ >> union { >> u8 targets; /* GICv2 target VCPUs mask */ >> u32 mpidr; /* GICv3 target VCPU */ >> }; >> u8 source; /* GICv2 SGIs only */ >> u8 priority; >> enum vgic_irq_config config; /* Level or edge */ >> }; >> >> ### VCPU list handling >> >> Initially a virtual interrupt just lives on its own. > > not sure what this means, see if you can clarify by being more concrete. Yeah, I didn't find any better wording. Basically that struct vgic_irq is not connected to anything, and is self-absorbed with playing with its fields and the lock ;-) >> Guest MMIO accesses to >> the distributor will change the state information in this structure. >> When an interrupt is actually made pending (either by an associated hardware >> IRQ firing or by a virtual IRQ trigger), the `vgic_irq` structure will be > > I think the distinction of what causes an interrupt to be fired should > be reworked in the document. The important bit is that the VGIC has a > virtual interrupt input line, which can be raised and lowered, which the > hypervisor can use to signal virtual interrupts. These may or may not > be tied to a physical interrupt, and they may therefore be marked as > hw=true or hw=false, respectively. OK, that's a good point. With the Xen focus on hw=1 interrupts, I got a bit confused, but indeed it's still a virtual interrupt, which needs to be triggered by the hypervisor. The connection to a physical IRQ is something on top of this, for later in the IRQ life cycle. Thanks for spelling this out! >> linked to the current target VCPU. The `vcpu` member in the structure will >> be set to this VCPU. Any affinity change after this point will not affect >> the current target VCPU anymore, it just updates the `target_vpu` field in >> the structure, which will be considered on the next injection. > > I think this description is a little vague. There are clear semantics > associated with these two fields: > > vcpu: The VCPU whose ap_list this interrupt is queued on (which > happens to be immutable for SGIs and PPIs) > > target_vcpu: For SGIs and LPIs, the configured target VCPU for an > interrupt. > > Once this is clearly defined, there are some rules in terms of when the > vcpu field can be changed; when queing a virtual interrupt for delivery > (because it's pending and/or active), the vcpu field field points to the > VCPU on which it is queud. The target_vcpu field simply records the > configuration, and can be changed by the hypervisor or the VM itself at > any time, but only the VCPU on whose AP list the virtual interrupt is > already queued, can change a non-NULL vcpu field to NULL or to a > different value, i.e. migrate the virtual interrupt. > > This is a requirement to ensure correct functionality; once you present > an active interrupt to a VCPU, you cannot take it away behind its back, > but you have to wait until the VCPU deactivates the interrupt. > >> This per-VCPU list is called the `ap_list`, since it holds interrupts which >> are in a pending and/or active state. >> >> ### Virtual IRQ references >> >> There is a function `vgic_get_irq()` which returns a reference to a virtual >> IRQ >> given its number. >> For private IRQs and SPIs it is expected that this just indexes a static >> array. >> For LPIs (which are dynamically allocated at run time) this is expected to >> iterate a data structure (like a linked list) to find the right structure. > > (or allocate one?) Mmh, we don't in our code: vgic_get_lpi() returns NULL if the LPI is not found, and needs vgic_add_lpi() to explicitly allocate one. I guess an implementation could choose to automatically allocate one, but this would be an implementation detail, wouldn't it? >> In any case a call to `vgic_get_irq` will increase a refcount, which will >> prevent LPIs from being de-allocated while another part of the VGIC is still >> holding a reference. Thus any caller to `vgic_get_irq` shall call >> `vgic_put_irq()` after it is done with handling this interrupt. > > Isn't this refcounting 101? I assume it's already used in Xen and the > rationale could be skipped here in the interest of focus. Yes, but this "call vgic_put_irq() after it is done" has consequences for the code: you have to do it explicitly and cannot just return a pointer from a function without passing the responsibility of "putting it" to the caller. I found this noteworthy, since the Xen VGIC code does violate this principle at the moment (which is fine since Xen does not need it). >> An exception would be if the virtual IRQ is eventually injected into a VCPU. >> In >> this case the VCPU holds that reference and it is kept as long as the guest >> sees this virtual IRQ. The refcount would only be decreased upon the IRQ >> having >> been EOIed by the guest and it having been removed from the VCPU list. > > Again, this seems to just explain an example of one of the references. > Reference counting works by counting references, freeing the resource > when the refecence reaches zero. That's about it. OK, I guess I can shorten the lengthy CS lecture here, just adding that being on an ap_list requires us to keep the reference, even if we leave the function and enter a VCPU. As mechanically one would pair a put() with a get() in a (leaf) function, the deviating need for spreading gets and puts over different code parts seems to be worthwhile to be mentioned here explicitly. >> ### Locking >> >> To keep the `vgic_irq` structure consistent and to avoid races between >> different parts of the VGIC, locking is essential whenever accessing a member > > nit: again a race cannot be avoided completely, but they can be made > benign... Right. >> of this structure. It is expected that this lock is almost never contended, >> also held only for brief periods of time, so this is considered cheap. >> To keep the code clean and avoid nasty corner cases, there are no tricks on >> trying to be lockless here. >> If for any reason the code needs to hold the locks for two virtual IRQs, the >> one with the lower IRQ number is to be taken first, to avoid deadlocks. >> >> Another lock to consider is the VCPU lock, which on the first glance protects >> the virtual CPU's list structure, but also synchronises additions and >> removals >> of IRQs from a VCPU. To add an IRQ to a list, both the VCPU and the per-IRQ >> lock need to be held. To avoid deadlocks, there is a strict locking order: >> >>> The VCPU lock needs to be taken first, the per-IRQ lock after this. >> >> Some operations (like migrating IRQs between two VCPUs) require two VCPU >> locks to be held, in this case the lock for the VCPU with the smaller VCPU ID >> is to be taken first. >> >> There are occasions where the locking order (VCPU first) is hard to observe, >> because the per-IRQ lock is already held, but this IRQ needs to go on a VCPU >> list. In this case the IRQ lock needs to be dropped, the respective VCPU >> lock should be taken, then the per-IRQ lock needs to be re-taken. >> After both the locks are held, we need to check if the conditions which >> originally mandated the list addition (or removal) are still true. This is >> needed because the IRQ lock could have been taken by another entity meanwhile >> and the state of this interrupt could have been changed. Examples are if the >> interrupt is no longer pending, got disabled or changed the CPU affinity. >> Some of those changes might render to current action obsolete (no longer >> pending), other will lead to a retry of the re-locking scheme described >> above. >> This re-locking scheme shall be implemented in a well-documented function. > > Do we have this documentation on the KVM side that you could link to > here for people to have an understanding of how this can be explained? > It's not that bad when you look at it really. Well, there is this function with 80% comments ;-) But I found it a bit cumbersome to refer to some existing implementation in a design document. Maybe I could add some pseudo code here? >> >> ### Level and edge triggered interrupts >> >> The GIC knows about two kinds of signalling interrupts: >> >> - Edge triggered interrupts are triggered by a device once, their life cycle >> ends when the guest has EOIed them, at which point we remove the pending >> state, >> clear the LR and return the `vgic_irq` structure to a quiescent state. > > For non-HW interrupts, you have the added potential complexity of > PENDING+ACTIVE. Well, I wanted to introduce level and edge triggered IRQs here, not completely alienate the reader ;-) I guess I shorten this to the text book definition of level vs. edge, and details on those cases later. >> - Level triggered interrupts are triggered when a device raises its interrupt >> line, they stay pending as long as this line is held high. At some point the >> driver in the guest is expected to program the device to explicitly or >> implicitly lower this interrupt line. That means that we have to store the >> state of the virtual interrupt line, which is only controlled by the >> (virtual) >> device. This is done in the `line_level` member of `struct vgic_irq`. >> >> To assert the interrupt condition, a (virtual) device calls a function >> exported >> by the VGIC, which allows to raise or lower an interrupt line. Lowering the >> line for an edge triggered IRQ is ignored (and so is optional). Raising the >> line asserts the pending state and potentially injects this virtual IRQ. Any >> subsequent "raising" call might inject another IRQ, if the previous has at >> least been activated by the guest already, otherwise is ignored. >> >> For level triggered interrupts this function stores the new state into the >> `line_level` variable, potentially injecting the interrupt if that line >> changes from false to true. If the line is lowered before the guest has >> seen it, this particular interrupt instance will be discarded. Successive >> "raising" calls will not lead to multiple interrupts if the line has not >> been lowered in between. > > This is confusing: Even for me, apparently ;-) > Lowering or raising the line for a level triggered > interrupt doesn't make any difference. The point is that as long as the > line is high, if you deactivate that interrupt, a new interrupt will hit > immediately again, unless the line has been lowered in the meantime. Yeah, it seems I need to seriously rework this paragraph. >> >> ### Software triggered interrupts >> >> Beside the naturally software triggered inter-processor-interrupts >> (SGIs in GIC speak), there is another way of letting software raise an >> interrupt condition. > > These three lines appear to belong to the heading... > >> The GIC distributor allows to set or clear both the pending and active state >> of any interrupt via MMIO registers. This isn't widely used by many operating >> systems, but is useful when saving and restoring the state of a machine. >> So emulating these functions is required for being architecture compliant, >> however the implementation might not need to be very efficient given its rare >> usage. In fact supporting the set-pending and clear-pending registers is >> relatively straight-forward, as long as one keeps this state separate from >> the emulated interrupt line. `pending_latch` stores this state in `vgic_irq`. >> >> The set-active and clear-active registers are much harder to emulate, though, >> as normally the active state is of little concern to the GIC emulation. In >> a normal interrupt life cycle the active state isn't even visible to the >> hypervisor, as it might be set and cleared again entirely within the guest >> in the list register, without exiting to the hypervisor. >> So manipulating the active state via the MMIO registers requires some heavy >> lifting: If this interrupt is currently injected into a running VCPU, this >> VCPU must exit, the active state must be set or cleared in the LR, then >> execution can continue. While this is expensive, as mentioned above this >> should not happen too often, also probably the system isn't very performance >> sensitive when using this feature for save and restore anyway. > > These two paragraphs not so much, they seem to belong to MMIO emulation, > and should probably follow the paragraph below. Yes, indeed it makes sense to move them. >> ### MMIO emulation >> >> As mentioned before, the distributor and redistributor part of the VGIC needs >> to be fully emulated. Those parts are characterised by a range of MMIO >> registers. The implementation shall provide a dispatcher function, which >> takes the faulted address, relative to the beginning of the MMIO range, and >> works out which actual register is affected. It then looks up the the >> respective handler function and calls it. Those functions are expected to >> be listed in a struct initialiser, which connects the actual register >> offset and its size to a particular handler. Having handler functions for >> a register range seems beneficial over handling registers in a switch/case, >> because it's easier to read and simplifies code sharing, for instance >> between the GICv2, GICv3 distributor and GICv3 redistributor registers >> with the same semantics. >> >> ### List register management >> >> A list register (LR) holds the state of a virtual interrupt, which will >> be used by the GIC hardware to simulate an IRQ life cycle for a guest. >> Each GIC hardware implementation can choose to implement a number of LRs, >> having four of them seems to be a common value. This design here does not >> try to manage the LRs very cleverly, instead on every guest exit every LR >> in use will be synced to the emulated state, then cleared. > > In fact I think we came up with counter-examples for every model of > being clever with not reading back the LRs, because you simply have to > observe any change in state that happened in hardware while the guest is > running, to be able to properly emulate compliant functionality of being > able to inject interrupts or not. Adding another one: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/arm/gic.c;hb=HEAD#l622 Sorry, Stefano, couldn't resist ;-) >> Upon guest entry >> the top priority virtual IRQs will be inserted into the LRs. If there are >> more pending or active IRQs than list registers, the GIC management IRQ >> will be configured to notify the hypervisor of a free LR (once the guest >> has EOIed one IRQ). This will trigger a normal exit, which will go through >> the normal cleanup/repopulate scheme, possibly now queuing the leftover >> interrupt(s). >> To facilitate quick guest exit and entry times, the VGIC maintains the list >> of pending or active interrupts (ap\_list) sorted by their priority. Active >> interrupts always go first on the list, since a guest and the hardware GIC >> expect those to stay until they have been explicitly deactivated. Failure >> in keeping active IRQs around will result in error conditions in the GIC. >> The second sort criteria for the ap\_list is their priority, so higher >> priority pending interrupt always go first into the LRs. > > > Otherwise, as I said, this is a really nice writeup. Thanks, much appreciated! Cheers, Andre. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.