[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Xen Summit 2025 - Design Discussion Notes - Xen ABI, second session



Xen ABIs/APIs second session notes  (apologies for errors, etc.)

Andrew Cooper (AC): There are two classes of hypercall:

"privileged": ops such as: set trap table, PV load IDT, that must only
ever be issued by a kernel, and others such as grant and event
channel ops, because they act on per-domain resources, so mediation by the
kernel is required.

"logical": ones for userspace to invoke with general audit by the kernel, but
the kernel mostly just checks that the buffer passing is safe.

so: classify all the hypercalls as input to the new design.

Example design issue: separating event channel ops and domctls used during
domain construction from others due to the difference in privilege required.
Plan: split hypercall ops that are currently combined.

Longer term plan: Flask policy, and in general one entity will be more priv
than another.
Many domctls don't apply to self due to VCPU pause semantics.
Can't require kernels to know about hypercall parameters to know this one
should not be referenced.

So: need to evaluate what each (existing) hypercall op can do.

Yann Sionneau (YS): Q: before changing privcmd, need to implement the new ABI?

AC: Yes. XenServer has a filter driver in the kernel patch queue as a stopgap
for now, not upstreaming it since aiming for new ABI changes first; progress
slowed in discussion upstream.

Want to get to a plan, for example: XenServer will do ... , Vates will do ...
Have a stopgap for lockdown mode and a plan for upstream.

Christopher Clark (CC): Q: for background to this session: said previously
XenServer's motivator for the new ABI work is adding support for UEFI Secure
Boot, and Vates is interested for the Encrypted VM implementation:
are there other known major feature motivators?

AC: Edera's Rust toolstack is affected by interface stability.

Within XenServer, affected is mostly xenops, performance stuff,
"xenguest" tool for implementing save, restore, etc.

QEMU makes hypercalls and distros can't ship one compatible with arbitrary
toolstacks, although we also want to support PVH guests with no QEMU.

Tapdisk is userspace and using /dev/xen/evtchn.

Windows on PVH: there is a plan!
Can write a ACPI table to put initial drivers into Windows:
WPBT, and Windows will want the drivers to be signed.

Daniel Smith (DS): "Windows Platform Binary Table"

AC: Windows runs as gen2 VMs under HyperV - has VMBus with PV disk and network
ie. pure virtual environment with nothing else
so there is a plan for fully virtual Windows guest.

Alexander Merrit (AM): do we want to write a document for all the hypercalls,
since we stumble over things as we work on them?
DS: there are headers!
AC: previous documentation implementation avoided doxygen.
Plan: kernel document with the sphinx plugin.

AM: would like more than just the call plus arguments documented, also:
how to use the call to achieve a goal.

AC: an example doc is the "lifecycle of a domid".  is a precursor to
"how to build a domain".

A large amount of complexity is due to not knowing bootloader or firmware
options - is a lot more simple than people think
eg. build memory, populate images, ...
sometimes we put in hvmloader or ovmf or run pygrub in dom0 to get kernel off
the disk and put it in.

Hypercalls used to build a domain are not documented. libxenctl.

AM: valuable as meant to be a stable interface, so put effort into the document.

AC: docs are missing the higher level "how to start a domain".

Also relevant: Alejandro's plan, sent a while ago as RFC to the list, a
somewhat long, language agnostic fashion description of structs.

complexity is difficult to generate C that you would want to consume: if it's
not identical, or has complexity such as compat support, it is difficult.

Alejandro Garcia Vallejo (AGV): still want to generate in a language-independent
fashion, but it is hard due to compat.

AC: so can be done only for the new ABI.

Result is: for all hypercalls, there will be several changes to each operation
how it works.

Discussed other than using C to describe ABI, because
eg. struct handling by the compiler in complex,
it can and will make changes
so the canonical description must not be C.
Ideally text, and then generate C and Rust and Go, etc.

DS: like Intel and AMD table documents?

AGV: do not want to implement a new language to describe, but is ok to describe
fields one after another with preamble, for example, saying: no padding,
no holes, etc.
so you don't need to be so precise for each item.

AC: unions is a headache for any language not C.
so we want to avoid them in memory structure.

?: ban them?
AC: they are useful in C. eg. domctl: header, command, union
so we want individual ops instead, and then can drop
the existing unions with a different structure.
Is reasonable to say "no to unions" but don't want to
outlaw them since will likely need the option to use them.

AGV: the problem is not the union, it's the absence of the type.

AC: sched_op: has a union with a tag inside the union,
so you need to know that data before you can access it.

For reasons of Xen being an academic project initially,
a notable Xen book was published containing an exercise on
how to write a scheduler; it was attempting to be a variadic
call over multiple schedulers.

XSM op: hypercall that may or may not be compiled into the
hypervisor or not. Is an accident of the separation of
XSM and Flask implementation. Should create a flask_op.

DS: XSM implementations: dummy, silo, flask (plus possibly
another soon). Was written originally as flask_op though.

AC: and also want a way to ask what XSM is in place.

DS: do you want a hypercall per XSM module?
or to properly multiplex the xsm_op ?

AC: Would like to avoid multiplexing as complicated for bindings.

DS: but flask is very unique due to what it is being asked to do.

... (discussion)
AC: no known buffer

DS: theoretically a capability-based XSM versus sid-based.
If no multiplexing, then you have conditional ops.

AGV: with lockdown, there is a limit to the payload of ops: all
direct and indirect references must be marshalled ahead payload
for secure boot. They cannot be fully opaque.

DS: today you send in label and get back the SID.
Either you implement multiplexing or you need two ops.

AM: would it make sense to define a set of principles to
define hypercalls?
to avoid arguing over each hypercall?

AC: yes! this is how we (CC + AC) have structured the documents
that we are creating for this. Third document is about
Principles for Improvement.

Teddy Astie (TA): We want the new ABI to be somewhat stable, but
how do we introduce new operations to it?

Jan Beulich (JB): adding subops will always work.

AC: domctl and sysctl: we frequently tweak those.
domctl has an example of buffer sizing: will get an invalid
response if there is a mismatch: allows extending an existing op.
Can do this in some cases or may be better to add a new sub op.

AGV: See what Linux does: if an interface doesn't work, they
add another op.

DS: Could you consider a TLV structure for adding new items?
Is common in Linux.
AC: An encoder and decoder are needed.
AGV: a problem to avoid?
DS: Is a pretty straightforward format; just a suggestion to consider.

AM: Has the NOVA hypervisor been looked at? Their hypercalls?
They only have 5 or 6 plus a capability-based system.
Calls are low-level, very simple, so not much churn on hypercalls.
Can we take from that curated approach?
but they don't do backwards compatibility.

AC: We care about backwards compatibility, so that's a problem.

AM: Are the concepts in the design useful?
AC: Don't know.
AGV: Their IPC path is optimized to not take a long time. Xen took
a different view, eg. page-table walks.
eg. get/set cpu context: many bytes in vCPU state: could instead
be done by get/set each index, is extensible but would need many calls.
AM: Could we create a multi-op call?
AC: We have multicall
~end of discussion



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.