Xen project Mailing List

On Dec 17, 2020, at 07:13, Jean-Philippe Ouellet <jpo@xxxxxx> wrote:

On Wed, Dec 16, 2020 at 2:37 PM Christopher Clark
<christopher.w.clark@xxxxxxxxx> wrote:
Hi all,

I have written a page for the OpenXT wiki describing a proposal for
initial development towards the VirtIO-Argo transport driver, and the
related system components to support it, destined for OpenXT and
upstream projects:

https://openxt.atlassian.net/wiki/spaces/~cclark/pages/1696169985/VirtIO-Argo+Development+Phase+1

Please review ahead of tomorrow's OpenXT Community Call.

I would draw your attention to the Comparison of Argo interface options section:

https://openxt.atlassian.net/wiki/spaces/~cclark/pages/1696169985/VirtIO-Argo+Development+Phase+1#Comparison-of-Argo-interface-options

where further input to the table would be valuable;
and would also appreciate input on the IOREQ project section:

https://openxt.atlassian.net/wiki/spaces/~cclark/pages/1696169985/VirtIO-Argo+Development+Phase+1#Project:-IOREQ-for-VirtIO-Argo

in particular, whether an IOREQ implementation to support the
provision of devices to the frontends can replace the need for any
userspace software to interact with an Argo kernel interface for the
VirtIO-Argo implementation.

thanks,
Christopher

Hi,

Really excited to see this happening, and disappointed that I'm not
able to contribute at this time. I don't think I'll be able to join
the call, but wanted to share some initial thoughts from my
middle-of-the-night review anyway.

Super rough notes in raw unedited notes-to-self form:

main point of feedback is: I love the desire to get a non-shared-mem
transport backend for virtio standardized. It moves us closer to an
HMX-only world. BUT: virtio is relevant to many hypervisors beyond
Xen, not all of which have the same views on how policy enforcement
should be done, namely some have a preference for capability-oriented
models over type-enforcement / MAC models. It would be nice if any
labeling encoded into the actual specs / guest-boundary protocols
would be strictly a mechanism, and be policy-agnostic, in particular
not making implicit assumptions about XSM / SELinux / similar. I don't
have specific suggestions at this point, but would love to discuss.

thoughts on how to handle device enumeration? hotplug notifications?
- can't rely on xenstore
- need some internal argo messaging for this?
- name service w/ well-known names? starts to look like xenstore
pretty quickly...
- granular disaggregation of backend device-model providers desirable

how does resource accounting work? each side pays for their own delivery ring?
- init in already-guest-mapped mem & simply register?
- how does it compare to grant tables?
- do you need to go through linux driver to alloc (e.g. xengntalloc)
or has way to share arbitrary otherwise not-special userspace pages
(e.g. u2mfn, with all its issues (pinning, reloc, etc.))?

ioreq is tangled with grant refs, evt chans, generic vmexit
dispatcher, instruction decoder, etc. none of which seems desirable if
trying to move towards world with strictly safer guest interfaces
exposed (e.g. HMX-only)
- there's no io to trap/decode here, it's explicitly exclusively via
hypercall to HMX, no?
- also, do we want argo sendv hypercall to be always blocking & synchronous?
- or perhaps async notify & background copy to other vm addr space?
- possibly better scaling?
- accounting of in-flight io requests to handle gets complicated
(see recent XSA)
- PCI-like completion request semantics? (argo as cross-domain
software dma engine w/ some basic protocol enforcement?)

"port" v4v driver => argo:
- yes please! something without all the confidence-inspiring
DEBUG_{APPLE,ORANGE,BANANA} indicators of production-worthy code would
be great ;)
- seems like you may want to redo argo hypercall interface too? (at
least the syscall interface...)
- targeting synchronous blocking sendv()?
- or some async queue/completion thing too? (like PF_RING, but with
*iov entries?)
- both could count as HMX, both could enforce no double-write racing
games at dest ring, etc.

re v4vchar & doing similar for argo:
- we may prefer "can write N bytes? -> yes/no" or "how many bytes can
write? -> N" over "try to write N bytes -> only wrote M, EAGAIN"
- the latter can be implemented over the former, but not the other way around
- starts to matter when you want to be able to implement in userspace
& provide backpressure to peer userspace without additional buffering
& potential lying about durability of writes
- breaks cross-domain EPIPE boundary correctness
- Qubes ran into same issues when porting vchan from Xen to KVM
initially via vsock

some virtio drivers explicitly use shared mem for more than just
communication rings:
- e.g. virtio-fs, which can map pages as DAX-like fs backing to share page cache
- e.g. virtio-gpu, virtio-wayland, virtio-video, which deal in framebuffers
- needs thought about how best to map semantics to (or at least
interoperate cleanly & safely with) HMX-{only,mostly} world
- the performance of shared mem actually can meaningfully matter for
e.g. large framebuffers in particular due to fundamental memory
bandwidth constraints

what is mentioned PX hypervisor? presumably short for PicoXen? any
public information?

Not much at the moment, but there is prior public work. PX is an OSS L0 "Protection Hypervisor" in the Hardened Access Terminal (HAT) architecture presented by Daniel Smith at the 2020 Xen Summit: https://youtube.com/watch?v=Wt-SBhFnDZY&t=3m48s

PX is intended to build on lessons learned from IBM Ultravisor, HP/Bromium AX and AIS Bareflank L0 hypervisors:

IBM: https://www.platformsecuritysummit.com/2019/speaker/hunt/

HP/Bromium: https://www.platformsecuritysummit.com/2018/speaker/pratt/

Dec 2019 meeting in Cambridge, Day2 discussion included L0 nesting hypervisor, UUID semantics, Argo, communication between nested hypervisors: https://lists.archive.carbon60.com/xen/devel/577800

Bareflank: https://youtube.com/channel/UCH-7Pw96K5V1RHAPn5-cmYA

Xen Summit 2020 design session notes: https://lists.archive.carbon60.com/xen/devel/591509

In the long-term, efficient hypervisor nesting will require close cooperation with silicon and firmware vendors. Note that Intel is introducing TDX (Trust Domain Extensions):

https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

https://www.brighttalk.com/webcast/18206/453600

There are also a couple of recent papers from Shanghai Jiao Tong University, on using hardware instructions to accelerate inter-domain HMX.

March 2019: https://ipads.se.sjtu.edu.cn/_media/publications/skybridge-eurosys19.pdf

> we present SkyBridge, a new communication facility designed and optimized for synchronous IPC in microkernels. SkyBridge requires no involvement of kernels during communication and allows a process to directly switch to the virtual address space of the target process and invoke the target function. SkyBridge retains the traditional virtual address space isolation and thus can be easily integrated into existing microkernels. The key idea of SkyBridge is to leverage a commodity hardware feature for virtualization (i.e., [Intel EPT] VMFUNC) to achieve efficient IPC. To leverage the hardware feature, SkyBridge inserts a tiny virtualization layer (Rootkernel) beneath the original microkernel (Subkernel). The Rootkernel is carefully designed to eliminate most virtualization overheads. SkyBridge also integrates a series of techniques to guarantee the security properties of IPC. We have implemented SkyBridge on three popular open-source microkernels (seL4, Fiasco.OC, and Google Zircon). The evaluation results show that SkyBridge improves the speed of IPC by 1.49x to 19.6x for microbenchmarks. For real-world applications (e.g., SQLite3 database), SkyBridge improves the throughput by 81.9%, 1.44x and 9.59x for the three microkernels on average.

July 2020: https://ipads.se.sjtu.edu.cn/_media/publications/guatc20.pdf

> a redesign of traditional microkernel OSes to harmonize the tension between messaging performance and isolation. UnderBridge moves the OS components of a microkernel between user space and kernel space at runtime while enforcing consistent isolation. It retrofits Intel Memory Protection Key for Userspace (PKU) in kernel space to achieve such isolation efficiently and design a fast IPC mechanism across those OS components. Thanks to PKU’s extremely low overhead, the inter-process communication (IPC) roundtrip cost in UnderBridge can be as low as 109 cycles. We have designed and implemented a new microkernel called ChCore based on UnderBridge and have also ported UnderBridge to three mainstream microkernels, i.e., seL4, Google Zircon, and Fiasco.OC. Evaluations show that UnderBridge speeds up the IPC by 3.0× compared with the state-of-the-art (e.g., SkyBridge) and improves the performance of IPC-intensive applications by up to 13.1× for the above three microkernels

For those interested in Argo and VirtIO, there will be a conference call on Thursday, Jan 14th 2021, at 1600 UTC.

Rich

Re: [openxt-dev] VirtIO-Argo initial development proposal