[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH for-4.18] docs/sphinx: Lifecycle of a domid


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Juergen Gross <jgross@xxxxxxxx>
  • Date: Tue, 17 Oct 2023 07:24:39 +0200
  • Authentication-results: smtp-out1.suse.de; none
  • Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
  • Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>
  • Delivery-date: Tue, 17 Oct 2023 05:24:54 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 16.10.23 18:24, Andrew Cooper wrote:
Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
---
CC: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
CC: Jan Beulich <JBeulich@xxxxxxxx>
CC: Stefano Stabellini <sstabellini@xxxxxxxxxx>
CC: Wei Liu <wl@xxxxxxx>
CC: Julien Grall <julien@xxxxxxx>
CC: Roger Pau Monné <roger.pau@xxxxxxxxxx>
CC: Juergen Gross <jgross@xxxxxxxx>
CC: Henry Wang <Henry.Wang@xxxxxxx>

Rendered form:
   
https://andrewcoop-xen.readthedocs.io/en/docs-devel/hypervisor-guide/domid-lifecycle.html

I'm not sure why it's using the alibaster theme and not RTD theme, but I
don't have time to debug that further at this point.

This was written mostly while sat waiting for flights in Nanjing and Beijing.

If while reading this you spot a hole, congratulations.  There are holes which
need fixing...
---
  docs/glossary.rst                         |   9 ++
  docs/hypervisor-guide/domid-lifecycle.rst | 164 ++++++++++++++++++++++
  docs/hypervisor-guide/index.rst           |   1 +
  3 files changed, 174 insertions(+)
  create mode 100644 docs/hypervisor-guide/domid-lifecycle.rst

diff --git a/docs/glossary.rst b/docs/glossary.rst
index 8ddbdab160a1..1fd1de0f0e97 100644
--- a/docs/glossary.rst
+++ b/docs/glossary.rst
@@ -50,3 +50,12 @@ Glossary
By default it gets all devices, including all disks and network cards, so
       is responsible for multiplexing guest I/O.
+
+   system domain
+     Abstractions within Xen that are modelled in a similar way to regular
+     :term:`domains<domain>`.  E.g. When there's no work to do, Xen schedules
+     ``DOMID_IDLE`` to put the CPU into a lower power state.
+
+     System domains have :term:`domids<domid>` and are referenced by
+     privileged software for certain control operations, but they do not run
+     guest code.
diff --git a/docs/hypervisor-guide/domid-lifecycle.rst 
b/docs/hypervisor-guide/domid-lifecycle.rst
new file mode 100644
index 000000000000..d405a321f3c7
--- /dev/null
+++ b/docs/hypervisor-guide/domid-lifecycle.rst
@@ -0,0 +1,164 @@
+.. SPDX-License-Identifier: CC-BY-4.0
+
+Lifecycle of a domid
+====================
+
+Overview
+--------
+
+A :term:`domid` is Xen's numeric identifier for a :term:`domain`.  In any
+operational Xen system, there are one or more domains running.
+
+Domids are 16-bit integers.  Regular domids start from 0, but there are some
+special identifiers, e.g. ``DOMID_SELF``, and :term:`system domains<system
+domain>`, e.g. ``DOMID_IDLE`` starting from 0x7ff0.  Therefore, a Xen system
+can run a maximum of 32k domains concurrently.
+
+.. note::
+
+   Despite being exposed in the domid ABI, the system domains are internal to
+   Xen and do not have lifecycles like regular domains.  Therefore, they are
+   not discussed further in this document.
+
+At system boot, Xen will construct one or more domains.  Kernels and
+configuration for these domains must be provided by the bootloader, or at
+Xen's compile time for more highly integrated solutions.
+
+Correct functioning of the domain lifecycle involves ``xenstored``, and some
+privileged entity which has bound the ``VIRQ_DOM_EXC`` global event channel.
+
+.. note::
+
+   While not a strict requirement for these to be the same entity, it is
+   ``xenstored`` which typically has ``VIRQ_DOM_EXC`` bound.  This document is
+   written assuming the common case.
+
+Creation
+--------
+
+Within Xen, the ``domain_create()`` function is used to allocate and perform
+bare minimum construction of a domain.  The :term:`control domain` accesses
+this functionality via the ``DOMCTL_createdomain`` hypercall.
+
+The final action that ``domain_create()`` performs before returning
+successfully is to enter the new domain into the domlist.  This makes the
+domain "visible" within Xen, allowing the new domid to be successfully
+referenced by other hypercalls.
+
+At this point, the domain exists as far as Xen is concerned, but not usefully
+as a VM yet.  The toolstack performs further construction activities;
+allocating vCPUs, RAM, copying in the initial executable code, etc.  Domains
+are automatically created with one "pause" reference count held, meaning that
+it is not eligible for scheduling.
+
+When the toolstack has finished VM construction, it send an ``XS_INTRODUCE``

s/send/sends/

+command to ``xenstored``.  This instructs ``xenstored`` to connect to the
+guest's xenstore ring, and fire the ``@introduceDomain`` watch.  The firing of
+this watch is the signal to all other components which care that a new VM has
+appeared and is about to start running.

A note should be added that the control domain is introduced implicitly by
xenstored, so no XS_INTRODUCE command is needed and no @introduceDomain watch
is being sent for the control domain.

All components interested in the @introduceDomain watch have to find out for
themselves which new domain has appeared, as the watch event doesn't contain
the domid of the new domain.

+
+When the ``XS_INTRODUCE`` command returns successfully, the final action the
+toolstack performs is to unpause the guest, using the ``DOMCTL_unpausedomain``
+hypercall.  This drops the "pause" reference the domain was originally created
+with, meaning that the vCPU(s) are eligible for scheduling and the domain will
+start executing its first instruction.
+
+.. note::
+
+   It is common for vCPUs other than 0 to be left in an offline state, to be
+   started by actions within the VM.
+
+Termination
+-----------
+
+The VM runs for a period of time, but eventually stops.  It can stop for a
+number of reasons, including:
+
+ * Directly at the guest kernel's request, via the ``SCHEDOP_shutdown``
+   hypercall.  The hypercall also includes the reason for the shutdown,
+   e.g. ``poweroff``, ``reboot`` or ``crash``.
+
+ * Indirectly from certain states.  E.g. executing a ``HLT`` instruction with
+   interrupts disabled is interpreted as a shutdown request as it is a common
+   code pattern for fatal error handling when no better options are available.
+
+ * Indirectly from fatal exceptions.  In some states, execution is unable to
+   continue, e.g. Triple Fault on x86.
+
+ * Directly from the device model, via the ``DMOP_remote_shutdown`` hypercall.
+   E.g. On x86, the 0xcf9 IO port is commonly used to perform platform
+   poweroff, reset or sleep transitions.
+
+ * Directly from the toolstack.  The toolstack is capable of initiating
+   cleanup directly, e.g. ``xl destroy``.  This is typically an administration
+   action of last resort to clean up a domain which malfunctioned but not
+   terminated properly.
+
+ * Directly from Xen.  Some error handling ends up using ``domain_crash()``
+   when Xen doesn't think it can safely continue running the VM.
+
+Whatever the reason for termination, Xen ends up calling ``domain_shutdown()``
+to set the shutdown reason and deschedule all vCPUs.  Xen also fires the
+``VIRQ_DOM_EXC`` event channel, which is a signal to ``xenstored``.
+
+Upon receiving ``VIRQ_DOM_EXC``, ``xenstored`` re-scans all domains using the
+``SYSCTL_getdomaininfolist`` hypercall.  If any domain has changed state from
+running to shut down, ``xenstored`` fires the ``@releaseDomain`` watch.  The
+firing of this watch is the signal to all other components which care that a
+VM has stopped.

The same as above applies: all components receiving the @releaseDomain watch
event have to find out themselves which domain has stopped.

+
+.. note::
+
+   Xen does not treat reboot differently to poweroff; both statuses are
+   forwarded to the toolstack.  It is up to the toolstack to restart the VM,
+   which is typically done by constructing a new domain.
+
+.. note::
+
+   Some shutdowns may not result in the cleanup of a domain.  ``suspend`` for
+   example can be used for snapshotting, and the VM resumes execution in the
+   same domain/domid.  Therefore, a domain can cycle several times between
+   running and "shut down" before moving into the destruction phase.
+
+Destruction
+-----------
+
+The domain object in Xen is reference counted, and survives until all
+references are dropped.
+
+The ``@releaseDomain`` watch is to inform all entities that hold a reference
+on the domain to clean up.  This may include:
+
+ * Paravirtual driver backends having a grant map of the shared ring with the
+   frontend.
+ * A device model with a map of the IOREQ page(s).
+
+The toolstack also has work to do in response to ``@releaseDomain``.  It must
+issue the ``DOMCTL_destroydomain`` hypercall.  This hypercall can take minutes
+of wall-clock time to complete for large domains as, amongst other things, it
+is freeing the domain's RAM back to the system.
+
+The actions triggered by the ``@releaseDomain`` watch are asynchronous.  There
+is no guarantee as to the order in which actions start, or which action is the
+final one to complete.  However, the toolstack can achieve some ordering by
+delaying the ``DOMCTL_destroydomain`` hypercall if necessary.
+
+Freeing
+-------
+
+When the final reference on the domain object is dropped, Xen will remove the
+domain from the domlist.  This means the domid is no longer visible in Xen,
+and no longer able to be referenced by other hypercalls.
+
+Xen then schedules the object for deletion at some point after any concurrent
+hypercalls referencing the domain have completed.
+
+When the object is finally cleaned up, Xen fires the ``VIRQ_DOM_EXC`` event
+channel again, causing ``xenstored`` to rescan an notice that the domain has

s/an/and/

+ceased to exist.  It fires the ``@releaseDomain`` watch a second time to
+signal to any components which care that the domain has gone away.
+
+E.g. The second ``@releaseDomain`` is commonly used by paravirtual driver
+backends to shut themselves down.

There is no guarantee that @releaseDomain will always be fired twice for a
domain ceasing to exist, and multiple domains disappearing might result in
only one @releaseDomain watch being fired. This means that any component receiving this watch event have not only to find out the domid(s) of the
domains changing state, but whether they have been shutting down only, or
are completely gone, too.

+
+At this point, the toolstack can reuse the domid for a new domain.
diff --git a/docs/hypervisor-guide/index.rst b/docs/hypervisor-guide/index.rst
index e4393b06975b..af88bcef8313 100644
--- a/docs/hypervisor-guide/index.rst
+++ b/docs/hypervisor-guide/index.rst
@@ -6,6 +6,7 @@ Hypervisor documentation
  .. toctree::
     :maxdepth: 2
+ domid-lifecycle
     code-coverage
x86/index

base-commit: dc9d9aa62ddeb14abd5672690d30789829f58f7e
prerequisite-patch-id: 832bdc9a23500d426b4fe11237ae7f6614f2369c


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.