[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Design session report: Live-Updating Xen


  • To: "Foerster, Leonard" <foersleo@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Wed, 17 Jul 2019 00:51:51 +0100
  • Authentication-results: esa6.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=andrew.cooper3@xxxxxxxxxx; spf=Pass smtp.mailfrom=Andrew.Cooper3@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; prefer-encrypt=mutual; keydata= mQINBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABtClBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPokCOgQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86LkCDQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAYkC HwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Delivery-date: Tue, 16 Jul 2019 23:52:20 +0000
  • Ironport-sdr: pBM6JL6HhZr5neQoXg2X/OS0vfhVTdsqCVRU7I+Iiu6h1w6U2iCCy2jXfI+MAVG1HZUImJEOu7 S60j4SI+6GIfs2tH0vOHR7VIpP1pyrgXmoRGT2jVMUmH+oSKVBaNm5Q00gtuVWOdMdU+YyLFoL 5rP57OHlLezNGKVXBb3TOAY5cxGGMHxAc6eCRVzoPzMQcNZZ1LxlXam44xcyon+tSny23i89G5 8oi+Rde0N+c29kIu/TT9hZbWYxufp09NjCZBFjhbMEKpjI8uRhz2hE2279fGfSlJ7GUjdDY/l8 qiU=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Openpgp: preference=signencrypt

On 15/07/2019 19:57, Foerster, Leonard wrote:
> Here is the summary/notes from the Xen Live-Update Design session last week.
> I tried to tie together the different topics we talked about into some 
> sections.
>
> https://cryptpad.fr/pad/#/2/pad/edit/fCwXg1GmSXXG8bc4ridHAsnR/
>
> --
> Leonard
>
> LIVE UPDATING XEN - DESING SESSION
>
> Brief project overview:
>       -> We want to build Xen Live-update
>       -> early prototyping phase
>       IDEA: change running hypervisor to new one without guest disruptions
>       -> Reasons:
>               * Security - we might need an updated versions for 
> vulnerability mitigation

I know I'm going to regret saying this, but livepatches are probably a
better bet in most cases for targeted security fixes.

>               * Development cycle acceleration - fast switch to hypervisor 
> during development
>               * Maintainability - reduce version diversity in the fleet

:) I don't expect you to admit anything concrete on xen-devel, but I do
hope the divergence it at least a little better under control than last
time I got given an answer to this question.

>       -> We are currently eyeing a combination of guest transparent live 
> migration
>               and kexec into a new xen build
>       -> For more details: 
> https://xensummit19.sched.com/event/PFVQ/live-updating-xen-amit-shah-david-woodhouse-amazon
>
> Terminology:
>       Running Xen -> The xen running on the host before update (Source)
>       Target Xen -> The xen we are updating *to*
>
> Design discussions:
>
> Live-update ties into multiple other projects currently done in the 
> Xen-project:
>
>       * Secret free Xen: reduce the footprint of guest relevant data in Xen
>               -> less state we might have to handle in the live update case

I don't immediately see how this is related.  Secret-free Xen is to do
with having fewer things mapped by default.  It doesn't fundamentally
change the data that Xen needs to hold about guests, nor how this gets
arranged in memory.

>       * dom0less: bootstrap domains without the involvement of dom0
>               -> this might come in handy to at least setup and continue dom0 
> on target xen
>               -> If we have this this might also enable us to de-serialize 
> the state for
>                       other guest-domains in xen and not have to wait for 
> dom0 to do this

Reconstruction of dom0 is something which Xen will definitely need to
do.  With the memory still in place, its just a fairly small of register
state which needs restoring.

That said, reconstruction of the typerefs will be an issue.  Walking
over a fully populated L4 tree can (in theory) take minutes, and its not
safe to just start executing without reconstruction.

Depending on how bad it is in practice, one option might be to do a
demand validate of %rip and %rsp, along with a hybrid shadow mode which
turns faults into typerefs, which would allow the gross cost of
revalidation to be amortised while the vcpus were executing.  We would
definitely want some kind of logic to aggressively typeref outstanding
pagetables so the shadow mode could be turned off.

> We want to just keep domain and hardware state
>       -> Xen is supposedly completely to be exchanged
>       -> We have to keep around the IOMMU page tables and do not touch them
>               -> this might also come in handy for some newer UEFI boot 
> related issues?

This is for Pre-DXE DMA protection, which IIRC is part of the UEFI 2.7
spec.  It basically means that the IOMMU is set up and inhibiting DMA
before any firmware starts using RAM.

In both cases, it involves Xen's IOMMU driver being capable of
initialising with the IOMMU already active, and in a way which keeps DMA
and interrupt remapping safe.

This is a chunk of work which should probably be split out into an
independent prerequisite.

>               -> We might have to go and re-inject certain interrupts

What hardware are you targeting here?  IvyBridge and later has a posted
interrupt descriptor which can accumulate pending interrupts (at least
manually), and newer versions (Broadwell?) can accumulate interrupts
directly from hardware.

>       -> do we need to dis-aggregate xenheap and domheap here?
>               -> We are currently trying to avoid this

I don't think this will be necessary, or indeed a useful thing to try
considering.  There should be an absolute minimal amount of dependency
between the two versions of Xen, to allow for the maximum flexibility in
upgradeable scenarios.

>
> A key cornerstone for Live-update is guest transparent live migration
>       -> This means we are using a well defined ABI for saving/restoring 
> domain state
>               -> We do only rely on domain state and no internal xen state

Absolutely.  One issue I discussed with David a while ago is that even
across an upgrade of Xen, the format of the EPT/NPT pagetables might
change, at least in terms of the layout of software bits.  (Especially
for EPT where we slowly lose software bits to new hardware features we
wish to use.)

>       -> The idea is to migrate the guest not from one machine to another (in 
> space)
>               but on the same machine from one hypervisor to another (in time)
>       -> In addition we want to keep as much as possible in memory unchanged 
> and feed
>               this back to the target domain in order to save time
>       -> This means we will need additional info on those memory areas and 
> have to
>               be super careful not to stomp over them while starting the 
> target xen
>       -> for live migration: domid is a problem in this case
>               -> randomize and pray does not work on smaller fleets
>               -> this is not a problem for live-update
>               -> BUT: as a community we shoudl make this restriction go away
>
> Exchanging the Hypervisor using kexec
>       -> We have patches on upstream kexec-tools merged that enable 
> multiboot2 for Xen
>       -> We can now load the target xen binary to the crashdump region to not 
> stomp
>               over any valuable date we might need later
>       -> But using the crashdump region for this has drawbacks when it comes 
> to debugging
>               and we might want to think about this later
>               -> What happens when live-update goes wrong?
>               -> Option: Increase Crashdump region size and partition it or 
> have a separate
>                       reserved live-update region to load the target xen into 
>               -> Separate region or partitioned region is not a priority for 
> V1 but should
>                       be on the road map for future versions

In terms of things needing physical contiguity, there is the Xen image
itself (a few MB), various driver datastructures (the IOMMU interrupt
remapping tables in particular, but I think we can probably scale the
size by the number of vectors behind them in practice, rather than
always making an order 7(or 8?) allocation to cover all 64k possible
handles.)  I think some of the directmap setup also expects to be able
to find free 2M superpages.

>
> Who serializes and deserializes domain state?
>       -> dom0: This should work fine, but who does this for dom0 itself?
>       -> Xen: This will need some more work, but might covered mostly by the 
> dom0less effort on the arm side
>               -> this will need some work for x86, but Stefano does not 
> consider this a lot of work
>       -> This would mean: serialize domain state into multiboot module and 
> set domains
>               up after kexecing xen in the dom0less manner
>               -> make multiboot module general enough so we can tag it as 
> boot/resume/create/etc.
>                       -> this will also enable us to do per-guest feature 
> enablement

What is the intent here?

>                       -> finer granular than specifying on cmdline
>                       -> cmdline stuff is mostly broken, needs to be fixed 
> for nested either way
>                       -> domain create flags is a mess

There is going to have to be some kind of translation from old state to
new settings.  In the past, lots of Xen was based on global settings, an
this is slowly being fixed into concrete per-domain settings.

>
> Live update instead of crashdump?
>       -> Can we use such capabilities to recover from a crash be "restarting" 
> xen on a crash?
>               -> live updating into (the same) xen on crash
>       -> crashing is a good mechanism because it happens if something is 
> really broken and
>               most likely not recoverable
>       -> Live update should be a conscious process and not something you do 
> as reaction to a crash
>               -> something is really broken if we crash
>               -> we should not proactively restart xen on crash
>                       -> we might run into crash loops
>       -> maybe this can be done in the future, but it is not changing 
> anything for the design
>               -> if anybody wants to wire this up once live update is there, 
> that should not be too hard
>               -> then you want to think about: scattering the domains to 
> multiple other hosts to not keep
>                       them on broken machines
>
> We should use this opportunity to clean up certain parts of the code base:
>       -> interface for domain information is a mess
>               -> HVM and PV have some shared data but completely different 
> ways of accessing it
>
> Volume of patches:
>       -> Live update: still developing, we do not know yet
>       -> guest transparent live migration:
>               -> We have roughly 100 patches over time
>               -> we believe most of this has just to be cleaned up/squashed 
> and
>                       will land us at a reasonable much lower number
>               -> this also needs 2-3 dom0 kernel patches
>
> Summary of action items:
>       -> coordinate with dom0less effort on what we can use and contribute 
> there
>       -> fix the domid clash problem
>       -> Decision on usage of crash kernel area
>       -> fix live migration patch set to include yet unsupported backends
>               -> clean up the patch set
>               -> upstream it
>
> Longer term vision:
>
> * Have a tiny hypervisor between Guest and Xen that handles the common cases
>       -> this enables (almost) zero downtime for the guest
>       -> the tiny hypervisor will maintain the guest while the underlying xen 
> is kexecing into new build
>
> * Somebody someday will want to get rid of the long tail of old xen versions 
> in a fleet
>       -> live patch old running versions with live update capability?
>       -> crashdumping into a new hypervisor?
>               -> "crazy idea" but this will likely come up at some point

How much do you need to patch an old Xen to have kexec take over
cleanly?  Almost all of the complexity is on the destination side
AFAICT, which is good from a development point of view.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.