[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 00/20] Add SMMUv3 Stage 1 Support for XEN guests
- To: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Julien Grall <julien@xxxxxxx>
- From: Milan Djokic <milan_djokic@xxxxxxxx>
- Date: Mon, 3 Nov 2025 14:16:40 +0100
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=epam.com; dmarc=pass action=none header.from=epam.com; dkim=pass header.d=epam.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=QJKCXyeQIQ6MKe5gJqCUMlJM8NccJKGHTDkMC52FwUI=; b=oW0KKthmYNpqZ053ssQfZ/23kTx0vJeZIeqs7An0iYFoTpEb/TPwSVTieHsFKWK+J0sOrl7AmE5Zv2jDYFs68uvNcPaQR9f4RlPGpYVTZqpk7I73gClUnvNOThurYw2kcxy+ns/Uw+9hoTXumc3kgHQ0ngRLI4640kedbIt7Jft2/iI8Xx8uVnvbYpISY/sxW70ttE+nDfiBp5zR9ODy0rE3deNhM9CWkcPcYdBhhrEr6gzGDC8vrqWySFbg1Fg7eTfnwI/boj4iv7xpiH7TsrpIKuslDDbvGCl3xmDW8f+leAO3YB3VKZGNYHi/Nj1QgFsB/iQcHxfjKGnRiIe+cw==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Ew5AuXKK0KwEUQuf34wQjRHGzeBeWG8RBRyd7xuoFfrXGF4CVBsCemjJ/Po/Ypezdm0Gr48rqTGUYG90yT8xJrg9hTC8FkZ1bfh3qCs7afM7IYjOriTDnQ2KG5RpdYYyL2acBDt58ZH1/PgYe0YEjDoRNtjBCBiysMPUy8aQIHRtFaCC+zF8EGa+SzMFuLiopc4RvAJP0SkzXD7qSeFppVmJhZxN9kRj7y+HaX2ta3LKBi2LPR0whAe6E9PTRzOrN5C1nutnTjHpTWbJt9N1RJRjEIy0Cr6Qel/xRZPPToqtokpCH70VOUQgc0HkILTFT+vsaPye/b+WfzaRsav+3Q==
- Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=epam.com;
- Cc: Julien Grall <julien.grall.oss@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Bertrand Marquis <bertrand.marquis@xxxxxxx>, Rahul Singh <rahul.singh@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Nick Rosbrook <enr0n@xxxxxxxxxx>, George Dunlap <gwd@xxxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
- Delivery-date: Mon, 03 Nov 2025 13:17:01 +0000
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
On 9/1/25 13:06, Milan Djokic wrote:
Hi Volodymyr,
On 8/29/25 18:27, Volodymyr Babchuk wrote:
Hi Milan,
Thanks, "Security Considerations" sections looks really good. But I have
more questions.
Milan Djokic <milan_djokic@xxxxxxxx> writes:
Hello Julien, Volodymyr
On 8/27/25 01:28, Volodymyr Babchuk wrote:
Hi Milan,
Milan Djokic <milan_djokic@xxxxxxxx> writes:
Hello Julien,
On 8/13/25 14:11, Julien Grall wrote:
On 13/08/2025 11:04, Milan Djokic wrote:
Hello Julien,
Hi Milan,
We have prepared a design document and it will be part of the updated
patch series (added in docs/design). I'll also extend cover letter with
details on implementation structure to make review easier.
I would suggest to just iterate on the design document for now.
Following is the design document content which will be provided in
updated patch series:
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================
Author: Milan Djokic <milan_djokic@xxxxxxxx>
Date: 2025-08-07
Status: Draft
Introduction
------------
The SMMUv3 supports two stages of translation. Each stage of translation
can be independently enabled. An incoming address is logically
translated from VA to IPA in stage 1, then the IPA is input to stage 2
which translates the IPA to the output PA. Stage 1 translation support
is required to provide isolation between different devices within the OS.
Xen already supports Stage 2 translation but there is no support for
Stage 1 translation. This design proposal outlines the introduction of
Stage-1 SMMUv3 support in Xen for ARM guests.
Motivation
----------
ARM systems utilizing SMMUv3 require Stage-1 address translation to
ensure correct and secure DMA behavior inside guests.
Can you clarify what you mean by "correct"? DMA would still work
without
stage-1.
Correct in terms of working with guest managed I/O space. I'll
rephrase this statement, it seems ambiguous.
This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation
Design Overview
---------------
These changes provide emulated SMMUv3 support:
- SMMUv3 Stage-1 Translation: stage-1 and nested translation support in
SMMUv3 driver
- vIOMMU Abstraction: virtual IOMMU framework for guest Stage-1 handling
So what are you planning to expose to a guest? Is it one vIOMMU per
pIOMMU? Or a single one?
Single vIOMMU model is used in this design.
Have you considered the pros/cons for both?
- Register/Command Emulation: SMMUv3 register emulation and command
queue handling
That's a point for consideration.
single vIOMMU prevails in terms of less complex implementation and a
simple guest iommmu model - single vIOMMU node, one interrupt path,
event queue, single set of trap handlers for emulation, etc.
Cons for a single vIOMMU model could be less accurate hw
representation and a potential bottleneck with one emulated queue and
interrupt path.
On the other hand, vIOMMU per pIOMMU provides more accurate hw
modeling and offers better scalability in case of many IOMMUs in the
system, but this comes with more complex emulation logic and device
tree, also handling multiple vIOMMUs on guest side.
IMO, single vIOMMU model seems like a better option mostly because
it's less complex, easier to maintain and debug. Of course, this
decision can and should be discussed.
Well, I am not sure that this is possible, because of StreamID
allocation. The biggest offender is of course PCI, as each Root PCI
bridge will require own SMMU instance with own StreamID space. But even
without PCI you'll need some mechanism to map vStremID to
<pSMMU, pStreamID>, because there will be overlaps in SID space.
Actually, PCI/vPCI with vSMMU is its own can of worms...
For each pSMMU, we have a single command queue that will receive command
from all the guests. How do you plan to prevent a guest hogging the
command queue?
In addition to that, AFAIU, the size of the virtual command queue is
fixed by the guest rather than Xen. If a guest is filling up the queue
with commands before notifying Xen, how do you plan to ensure we don't
spend too much time in Xen (which is not preemptible)?
We'll have to do a detailed analysis on these scenarios, they are not
covered by the design (as well as some others which is clear after
your comments). I'll come back with an updated design.
I think that can be handled akin to hypercall continuation, which is
used in similar places, like P2M code
[...]
I have updated vIOMMU design document with additional security topics
covered and performance impact results. Also added some additional
explanations for vIOMMU components following your comments.
Updated document content:
===============================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
===============================================
:Author: Milan Djokic <milan_djokic@xxxxxxxx>
:Date: 2025-08-07
:Status: Draft
Introduction
========
The SMMUv3 supports two stages of translation. Each stage of
translation can be
independently enabled. An incoming address is logically translated
from VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3
support in Xen for ARM guests.
Motivation
==========
ARM systems utilizing SMMUv3 require stage-1 address translation to
ensure secure DMA and guest managed I/O memory mappings.
It is unclear for my what you mean by "guest manged IO memory mappings",
could you please provide an example?
Basically enabling stage-1 translation means that the guest is
responsible for managing IOVA to IPA mappings through its own IOMMU
driver. Guest manages its own stage-1 page tables and TLB.
For example, when a guest driver wants to perform DMA mapping (e.g. with
dma_map_single()), it will request mapping of its buffer physical
address to IOVA through guest IOMMU driver. Guest IOMMU driver will
further issue mapping commands emulated by Xen which translate it into
stage-2 mappings.
This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough under secure memory translation
As I see it, ARM specs use "secure" mostly when referring to Secure mode
(S-EL1, S-EL2, EL3) and associated secure counterparts of architectural
devices, like secure GIC, secure Timer, etc. So I'd probably don't use
this word here to reduce confusion
Sure, secure in terms of isolation is the topic here. I'll rephrase this
Design Overview
===============
These changes provide emulated SMMUv3 support:
- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation
support in SMMUv3 driver.
"Nested translation" as in "nested virtualization"? Or is this something else?
No, this refers to 2-stage translation IOVA->IPA->PA as a nested
translation. Although with this feature, nested virtualization is also
enabled since guest can emulate its own IOMMU e.g. when kvm is run in guest.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
handling.
I think, this is the big topic. You see, apart from SMMU, there is
at least Renesas IP-MMU, which uses completely different API. And
probably there are other IO-MMU implementations possible. Right now
vIOMMU framework handles only SMMU, which is okay, but probably we
should design it in a such way, that other IO-MMUs will be supported as
well. Maybe even IO-MMUs for other architectures (RISC V maybe?).
I think that it is already designed in such manner. We have a generic
vIOMMU framework and a backend implementation for target IOMMU as
separate components. And the backend implements supported
commands/mechanisms which are specific for target IOMMU type. At this
point, only SMMUv3 is supported, but it is possible to implement other
IOMMU types support under the same generic framework. AFAIK, RISC-V
IOMMU stage-2 is still in early development stage, but I do believe that
it will be also compatible with vIOMMU framework.
- **Register/Command Emulation**: SMMUv3 register emulation and
command queue handling.
Continuing previous paragraph: what about other IO-MMUs? For example, if
platform provides only Renesas IO-MMU, will vIOMMU framework still
emulate SMMUv3 registers and queue handling?
Yes, this is not supported in current implementation. To support other
IOMMU than SMMUv3, stage-1 emulation backend needs to be implemented for
target IOMMU and probably Xen driver for target IOMMU has to be updated
to handle stage-1 configuration. I will elaborate this part in the
design, to make clear that we have a generic vIOMMU framework, but only
SMMUv3 backend exists atm.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes
to device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for
dynamic enablement.
vIOMMU is exposed to guest as a single device with predefined
capabilities and commands supported. Single vIOMMU model abstracts the
details of an actual IOMMU hardware, simplifying usage from the guest
point of view. Guest OS handles only a single IOMMU, even if multiple
IOMMU units are available on the host system.
In the previous email I asked how are you planning to handle potential
SID overlaps, especially in PCI use case. I want to return to this
topic. I am not saying that this is impossible, but I'd like to see this
covered in the design document.
Sorry, I've missed this part in the previous mail. This is a valid point,
SID overlapping would be an issue for a single vIOMMU model. To prevent
it, design will have to be extended with SID namespace virtualization,
introducing a remapping layer which will make sure that guest virtual
SIDs are unique and maintain proper mappings of vSIDs to pSIDs.
For PCI case, we need to have an extended remapping logic where
iommu-map property will be also patched in the guest device tree since
we need a range of unique vSIDs for every RC assigned to guest.
Alternative approach would be to switch to vIOMMU per pIOMMU model.
Since both approaches require major updates, I'll have to do a detailed
analysis and come back with an updated design which would address this
issue.
Security Considerations
=======================
**viommu security benefits:**
- Stage-1 translation ensures guest devices cannot perform unauthorized DMA.
- Emulated IOMMU removes guest dependency on IOMMU hardware while
maintaining domains isolation.
I am not sure that I got this paragraph.
First one refers to guest controlled DMA access. Only IOVA->IPA mappings
created by guest are usable by the device when stage-1 is enabled. On
the other hand, with stage-2 only enabled, device could access to
complete IOVA->PA mapping created by Xen for guest. Since the guest has
no control over device IOVA accesses, a malicious guest kernel could
potentially access memory regions it shouldn't be allowed to, e.g. if
stage-2 mappings are stale. With stage-1 enabled, guest device driver
has to explicitly map IOVAs and this request is propagated through
emulated IOMMU, making sure that IOVA mappings are valid all the time.
Second claim means that with emulated IOMMU, guests don’t need direct
access to physical IOMMU hardware. The hypervisor emulates IOMMU
behavior for the guest, while still ensuring that memory access by
devices remains properly isolated between guests, just like it would
with real IOMMU hardware.
1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data
structures (`s1_cfg` alongside `s2_cfg`) and logic to write both
Stage-1 and Stage-2 entries in the Stream Table Entry (STE), including
an `abort` field to handle partial configuration states.
**Risk:**
Without proper handling, a partially applied Stage-1 configuration
might leave guest DMA mappings in an inconsistent state, potentially
enabling unauthorized access or causing cross-domain interference.
**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg`
to STE and manages the `abort` field-only considering Stage-1
configuration if fully attached. This ensures incomplete or invalid
guest configurations are safely ignored by the hypervisor.
2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs
forwarding to SMMUv3 hardware to maintain coherence.
**Risk:**
Failing to propagate cache invalidation could allow stale mappings,
enabling access to old mappings and possibly data leakage or
misrouting.
**Mitigation:** *(Handled by design)*
This feature ensures that guest-initiated invalidations are correctly
forwarded to the hardware, preserving IOMMU coherency.
3. Observation:
---------------
This design introduces substantial new functionality, including the
`vIOMMU` framework, virtual SMMUv3 devices (`vsmmuv3`), command
queues, event queues, domain management, and Device Tree modifications
(e.g., `iommus` nodes and `libxl` integration).
**Risk:**
Large feature expansions increase the attack surface—potential for
race conditions, unchecked command inputs, or Device Tree-based
misconfigurations.
**Mitigation:**
- Sanity checks and error-handling improvements have been introduced
in this feature.
- Further audits have to be performed for this feature and its
dependencies in this area. Currently, feature is marked as *Tech
Preview* and is self-contained, reducing the risk to unrelated
components.
4. Observation:
---------------
The code includes transformations to handle nested translation versus
standard modes and uses guest-configured command queues (e.g.,
`CMD_CFGI_STE`) and event notifications.
**Risk:**
Malicious or malformed queue commands from guests could bypass
validation, manipulate SMMUv3 state, or cause Dom0 instability.
Only Dom0?
This is a mistake, the whole system could be affected. I'll fix this.
**Mitigation:** *(Handled by design)*
Built-in validation of command queue entries and sanitization
mechanisms ensure only permitted configurations are applied. This is
supported via additions in `vsmmuv3` and `cmdqueue` handling code.
5. Observation:
---------------
Device Tree modifications enable device assignment and
configuration—guest DT fragments (e.g., `iommus`) are added via
`libxl`.
**Risk:**
Erroneous or malicious Device Tree injection could result in device
misbinding or guest access to unauthorized hardware.
**Mitigation:**
- `libxl` perform checks of guest configuration and parse only
predefined dt fragments and nodes, reducing risc.
- The system integrator must ensure correct resource mapping in the
guest Device Tree (DT) fragments.
6. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in
xl guest config) means some guests may opt-out.
**Risk:**
Differences between guests with and without `viommu` may cause
unexpected behavior or privilege drift.
**Mitigation:**
Verify that downgrade paths are safe and well-isolated; ensure missing
support doesn't cause security issues. Additional audits on emulation
paths and domains interference need to be performed in a multi-guest
environment.
7. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache
invalidation, stream table entries configuration, etc. An adversarial
guest may issue a high volume of commands in rapid succession.
**Risk**
Excessive commands requests can cause high hypervisor CPU consumption
and disrupt scheduling, leading to degraded system responsiveness and
potential denial-of-service scenarios.
**Mitigation**
- Xen credit scheduler limits guest vCPU execution time, securing
basic guest rate-limiting.
I don't thing that this feature available only in credit schedulers,
AFAIK, all schedulers except null scheduler will limit vCPU execution time.
I was not aware of that. I'll rephrase this part.
- Batch multiple commands of same type to reduce overhead on the
virtual SMMUv3 hardware emulation.
- Implement vIOMMU commands execution restart and continuation support
So, something like "hypercall continuation"?
Yes
8. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU
command queue (e.g. TLB invalidate). For each pIOMMU, only one command
queue is
available for all domains.
**Risk**
Excessive commands requests from abusive guest can cause flooding of
physical IOMMU command queue, leading to degraded pIOMMU responsivness
on commands issued from other guests.
**Mitigation**
- Xen credit scheduler limits guest vCPU execution time, securing
basic guest rate-limiting.
- Batch commands which should be propagated towards pIOMMU cmd queue
and enable support for batch execution pause/continuation
- If possible, implement domain penalization by adding a per-domain
cost counter for vIOMMU/pIOMMU usage.
9. Observation:
---------------
vIOMMU feature includes event queue used for forwarding IOMMU events
to guest (e.g. translation faults, invalid stream IDs, permission
errors). A malicious guest can misconfigure its SMMU state or
intentionally trigger faults with high frequency.
**Risk**
Occurance of IOMMU events with high frequency can cause Xen to flood
the event queue and disrupt scheduling with high hypervisor CPU load
for events handling.
**Mitigation**
- Implement fail-safe state by disabling events forwarding when faults
are occured with high frequency and not processed by guest.
- Batch multiple events of same type to reduce overhead on the virtual
SMMUv3 hardware emulation.
- Consider disabling event queue for untrusted guests
Performance Impact
==================
With iommu stage-1 and nested translation inclusion, performance
overhead is introduced comparing to existing, stage-2 only usage in
Xen.
Once mappings are established, translations should not introduce
significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting
device initialization and event handling.
Performance impact highly depends on target CPU capabilities. Testing
is performed on cortex-a53 based platform.
Which platform exactly? While QEMU emulates SMMU to some extent, we are
observing somewhat different SMMU behavior on real HW platforms (mostly
due to cache coherence problems). Also, according to MMU-600 errata, it
can have lower than expected performance in some use-cases.
Performance measurement are done on QEMU emulated Renesas platform. I'll
add some details for this.
Performance is mostly impacted by emulated vIOMMU operations, results
shown in the following table.
+-------------------------------+---------------------------------+
| vIOMMU Operation | Execution time in guest |
+===============================+=================================+
| Reg read | median: 30μs, worst-case: 250μs |
+-------------------------------+---------------------------------+
| Reg write | median: 35μs, worst-case: 280μs |
+-------------------------------+---------------------------------+
| Invalidate TLB | median: 90μs, worst-case: 1ms+ |
+-------------------------------+---------------------------------+
| Invalidate STE | median: 450μs worst_case: 7ms+ |
+-------------------------------+---------------------------------+
With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
and configure stage-1 mappings for devices attached to it.
Following table shows initialization stages which impact stage-1
enabled guest boot time and compares it with stage-1 disabled guest.
"NOTE: Device probe execution time varies significantly depending on
device complexity. virtio-gpu was selected as a test case due to its
extensive use of dynamic DMA allocations and IOMMU mappings, making it
a suitable candidate for benchmarking stage-1 vIOMMU behavior."
+---------------------+-----------------------+------------------------+
| Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init | ~25ms | / |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~220ms | ~200ms |
+---------------------+-----------------------+------------------------+
For devices configured with dynamic DMA mappings, DMA
allocate/map/unmap operations performance is also impacted on stage-1
enabled guests.
Dynamic DMA mapping operation issues emulated IOMMU functions like
mmio write/read and TLB invalidations.
As a reference, following table shows performance results for runtime
dma operations for virtio-gpu device.
+---------------+-------------------------+----------------------------+
| DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+===============+=========================+============================+
| dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
+---------------+-------------------------+----------------------------+
| dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
+---------------+-------------------------+----------------------------+
| dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
+---------------+-------------------------+----------------------------+
| dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
+---------------+-------------------------+----------------------------+
Testing
============
- QEMU-based ARM system tests for Stage-1 translation and nested
virtualization.
- Actual hardware validation on platforms such as Renesas to ensure
compatibility with real SMMUv3 implementations.
- Unit/Functional tests validating correct translations (not implemented).
Migration and Compatibility
===========================
This optional feature defaults to disabled (`viommu=""`) for backward
compatibility.
BR,
Milan
Hello Volodymyr, Julien
Sorry for the delayed follow-up on this topic.
We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
pIOMMU. Considering single vIOMMU model limitation pointed out by
Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
only proper solution.
Following is the updated design document.
I have added additional details to the design and performance impact
sections, and also indicated future improvements. Security
considerations section is unchanged apart from some minor details
according to review comments.
Let me know what do you think about updated design. Once approved, I
will send the updated vIOMMU patch series.
==========================================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================
:Author: Milan Djokic <milan_djokic@xxxxxxxx>
:Date: 2025-11-03
:Status: Draft
Introduction
============
The SMMUv3 supports two stages of translation. Each stage of translation
can be
independently enabled. An incoming address is logically translated from
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the IPA to
the output PA. Stage 1 translation support is required to provide
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support
in Xen for ARM guests.
Motivation
==========
ARM systems utilizing SMMUv3 require stage-1 address translation to
ensure secure DMA and
guest managed I/O memory mappings.
With stage-1 enabed, guest manages IOVA to IPA mappings through its own
IOMMU driver.
This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough with per-device address translation table
Design Overview
===============
These changes provide emulated SMMUv3 support:
- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support
in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command
queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to
device trees for dom0 and dom0less scenarios.
- **Runtime Configuration**: Introduces a `viommu` boot parameter for
dynamic enablement.
Separate vIOMMU device is exposed to guest for every physical IOMMU in
the system.
vIOMMU feature is designed in a way to provide a generic vIOMMU
framework and a backend implementation
for target IOMMU as separate components.
Backend implementation contains specific IOMMU structure and commands
handling (only SMMUv3 currently supported).
This structure allows potential reuse of stage-1 feature for other IOMMU
types.
Security Considerations
=======================
**viommu security benefits:**
- Stage-1 translation ensures guest devices cannot perform unauthorized
DMA (device I/O address mapping managed by guest).
- Emulated IOMMU removes guest direct dependency on IOMMU hardware,
while maintaining domains isolation.
1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures
(`s1_cfg` alongside `s2_cfg`)
and logic to write both Stage-1 and Stage-2 entries in the Stream Table
Entry (STE), including an `abort`
field to handle partial configuration states.
**Risk:**
Without proper handling, a partially applied Stage-1 configuration might
leave guest DMA mappings in an
inconsistent state, potentially enabling unauthorized access or causing
cross-domain interference.
**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to
STE and manages the `abort` field-only
considering Stage-1 configuration if fully attached. This ensures
incomplete or invalid guest configurations
are safely ignored by the hypervisor.
2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding
to SMMUv3 hardware to maintain coherence.
**Risk:**
Failing to propagate cache invalidation could allow stale mappings,
enabling access to old mappings and possibly
data leakage or misrouting.
**Mitigation:** *(Handled by design)*
This feature ensures that guest-initiated invalidations are correctly
forwarded to the hardware,
preserving IOMMU coherency.
3. Observation:
---------------
This design introduces substantial new functionality, including the
`vIOMMU` framework, virtual SMMUv3
devices (`vsmmuv3`), command queues, event queues, domain management,
and Device Tree
modifications (e.g., `iommus` nodes and `libxl` integration).
**Risk:**
Large feature expansions increase the attack surface potential for race
conditions, unchecked command inputs,
or Device Tree-based misconfigurations.
**Mitigation:**
- Sanity checks and error-handling improvements have been introduced in
this feature.
- Further audits have to be performed for this feature and its
dependencies in this area.
4. Observation:
---------------
The code includes transformations to handle nested translation versus
standard modes and uses guest-configured
command queues (e.g., `CMD_CFGI_STE`) and event notifications.
**Risk:**
Malicious or malformed queue commands from guests could bypass
validation, manipulate SMMUv3 state,
or cause system instability.
**Mitigation:** *(Handled by design)*
Built-in validation of command queue entries and sanitization mechanisms
ensure only permitted configurations
are applied. This is supported via additions in `vsmmuv3` and `cmdqueue`
handling code.
5. Observation:
---------------
Device Tree modifications enable device assignment and configuration
through guest DT fragments (e.g., `iommus`)
are added via `libxl`.
**Risk:**
Erroneous or malicious Device Tree injection could result in device
misbinding or guest access to unauthorized
hardware.
**Mitigation:**
- `libxl` perform checks of guest configuration and parse only
predefined dt fragments and nodes, reducing risk.
- The system integrator must ensure correct resource mapping in the
guest Device Tree (DT) fragments.
6. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl
guest config) means some guests
may opt-out.
**Risk:**
Differences between guests with and without `viommu` may cause
unexpected behavior or privilege drift.
**Mitigation:**
Verify that downgrade paths are safe and well-isolated; ensure missing
support doesn't cause security issues.
Additional audits on emulation paths and domains interference need to be
performed in a multi-guest environment.
7. Observation:
---------------
Guests have the ability to issue Stage-1 IOMMU commands like cache
invalidation, stream table entries
configuration, etc. An adversarial guest may issue a high volume of
commands in rapid succession.
**Risk:**
Excessive commands requests can cause high hypervisor CPU consumption
and disrupt scheduling,
leading to degraded system responsiveness and potential
denial-of-service scenarios.
**Mitigation:**
- Xen scheduler limits guest vCPU execution time, securing basic guest
rate-limiting.
- Batch multiple commands of same type to reduce overhead on the virtual
SMMUv3 hardware emulation.
- Implement vIOMMU commands execution restart and continuation support
8. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU
command queue (e.g. TLB invalidate).
**Risk:**
Excessive commands requests from abusive guest can cause flooding of
physical IOMMU command queue,
leading to degraded pIOMMU responsivness on commands issued from other
guests.
**Mitigation:**
- Xen credit scheduler limits guest vCPU execution time, securing basic
guest rate-limiting.
- Batch commands which should be propagated towards pIOMMU cmd queue and
enable support for batch
execution pause/continuation
- If possible, implement domain penalization by adding a per-domain cost
counter for vIOMMU/pIOMMU usage.
9. Observation:
---------------
vIOMMU feature includes event queue used for forwarding IOMMU events to
guest
(e.g. translation faults, invalid stream IDs, permission errors).
A malicious guest can misconfigure its SMMU state or intentionally
trigger faults with high frequency.
**Risk:**
Occurance of IOMMU events with high frequency can cause Xen to flood the
event queue and disrupt scheduling with
high hypervisor CPU load for events handling.
**Mitigation:**
- Implement fail-safe state by disabling events forwarding when faults
are occured with high frequency and
not processed by guest.
- Batch multiple events of same type to reduce overhead on the virtual
SMMUv3 hardware emulation.
- Consider disabling event queue for untrusted guests
Performance Impact
==================
With iommu stage-1 and nested translation inclusion, performance
overhead is introduced comparing to existing,
stage-2 only usage in Xen. Once mappings are established, translations
should not introduce significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting
device initialization and event handling.
Performance impact highly depends on target CPU capabilities.
Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated)
platforms.
Performance is mostly impacted by emulated vIOMMU operations, results
shown in the following table.
+-------------------------------+---------------------------------+
| vIOMMU Operation | Execution time in guest |
+===============================+=================================+
| Reg read | median: 30μs, worst-case: 250μs |
+-------------------------------+---------------------------------+
| Reg write | median: 35μs, worst-case: 280μs |
+-------------------------------+---------------------------------+
| Invalidate TLB | median: 90μs, worst-case: 1ms+ |
+-------------------------------+---------------------------------+
| Invalidate STE | median: 450μs worst_case: 7ms+ |
+-------------------------------+---------------------------------+
With vIOMMU exposed to guest, guest OS has to initialize IOMMU device
and configure stage-1 mappings for devices
attached to it.
Following table shows initialization stages which impact stage-1 enabled
guest boot time and compares it with
stage-1 disabled guest.
"NOTE: Device probe execution time varies significantly depending on
device complexity. virtio-gpu was selected
as a test case due to its extensive use of dynamic DMA allocations and
IOMMU mappings, making it a suitable
candidate for benchmarking stage-1 vIOMMU behavior."
+---------------------+-----------------------+------------------------+
| Stage | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+=====================+=======================+========================+
| IOMMU Init | ~25ms | / |
+---------------------+-----------------------+------------------------+
| Dev Attach / Mapping| ~220ms | ~200ms |
+---------------------+-----------------------+------------------------+
For devices configured with dynamic DMA mappings, DMA allocate/map/unmap
operations performance is
also impacted on stage-1 enabled guests.
Dynamic DMA mapping operation trigger emulated IOMMU functions like mmio
write/read and TLB invalidations.
As a reference, following table shows performance results for runtime
dma operations for virtio-gpu device.
+---------------+-------------------------+----------------------------+
| DMA Op | Stage-1 Enabled Guest | Stage-1 Disabled Guest |
+===============+=========================+============================+
| dma_alloc | median: 27μs, worst: 7ms| median: 2.5μs, worst: 360μs|
+---------------+-------------------------+----------------------------+
| dma_free | median: 1ms, worst: 14ms| median: 2.2μs, worst: 85μs |
+---------------+-------------------------+----------------------------+
| dma_map | median: 25μs, worst: 7ms| median: 1.5μs, worst: 336μs|
+---------------+-------------------------+----------------------------+
| dma_unmap | median: 1ms, worst: 13ms| median: 1.3μs, worst: 65μs |
+---------------+-------------------------+----------------------------+
Testing
=======
- QEMU-based ARM system tests for Stage-1 translation.
- Actual hardware validation to ensure compatibility with real SMMUv3
implementations.
- Unit/Functional tests validating correct translations (not implemented).
Migration and Compatibility
===========================
This optional feature defaults to disabled (`viommu=""`) for backward
compatibility.
Future improvements
===================
- Implement the proposed mitigations to address security risks that are
not covered by the current
design (events batching, commands execution continuation)
- Support for other IOMMU HW (Renesas, RISC-V, etc.)
- Due to static definition of SPIs and MMIO regions for emulated
devices, current implementation statically
defines SPIs and MMIO regions for up to 16 vIOMMUs per guest. Future
improvements would include configurable
number of IOMMUs or automatic runtime resolution for target platform.
References
==========
- Original feature implemented by Rahul Singh:
https://patchwork.kernel.org/project/xen-devel/cover/cover.1669888522.git.rahul.singh@xxxxxxx/
- SMMUv3 architecture documentation
- Existing vIOMMU code patterns
BR,
Milan
|