[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v2 1/5] xen/common: introduce a new framework for save/restore of 'domain' context



> -----Original Message-----
> From: Julien Grall <julien@xxxxxxx>
> Sent: 20 April 2020 18:21
> To: Paul Durrant <paul@xxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx
> Cc: Paul Durrant <pdurrant@xxxxxxxxxx>; Andrew Cooper 
> <andrew.cooper3@xxxxxxxxxx>; George Dunlap
> <george.dunlap@xxxxxxxxxx>; Ian Jackson <ian.jackson@xxxxxxxxxxxxx>; Jan 
> Beulich <jbeulich@xxxxxxxx>;
> Stefano Stabellini <sstabellini@xxxxxxxxxx>; Wei Liu <wl@xxxxxxx>; Volodymyr 
> Babchuk
> <Volodymyr_Babchuk@xxxxxxxx>; Roger Pau Monné <roger.pau@xxxxxxxxxx>
> Subject: Re: [PATCH v2 1/5] xen/common: introduce a new framework for 
> save/restore of 'domain' context
> 
> Hi Paul,
> 
> On 07/04/2020 18:38, Paul Durrant wrote:
> > To allow enlightened HVM guests (i.e. those that have PV drivers) to be
> > migrated without their co-operation it will be necessary to transfer 'PV'
> > state such as event channel state, grant entry state, etc.
> >
> > Currently there is a framework (entered via the hvm_save/load() functions)
> > that allows a domain's 'HVM' (architectural) state to be transferred but
> > 'PV' state is also common with pure PV guests and so this framework is not
> > really suitable.
> >
> > This patch adds the new public header and low level implementation of a new
> > common framework, entered via the domain_save/load() functions. Subsequent
> > patches will introduce other parts of the framework, and code that will
> > make use of it within the current version of the libxc migration stream.
> >
> > This patch also marks the HVM-only framework as deprecated in favour of the
> > new framework.
> >
> > Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx>
> > ---
> > Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> > Cc: George Dunlap <george.dunlap@xxxxxxxxxx>
> > Cc: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
> > Cc: Jan Beulich <jbeulich@xxxxxxxx>
> > Cc: Julien Grall <julien@xxxxxxx>
> > Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>
> > Cc: Wei Liu <wl@xxxxxxx>
> > Cc: Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>
> > Cc: "Roger Pau Monné" <roger.pau@xxxxxxxxxx>
> >
> > v2:
> >   - Allow multi-stage save/load to avoid the need to double-buffer
> >   - Get rid of the masks and add an 'ignore' flag instead
> >   - Create copy function union to preserve const save buffer
> >   - Deprecate HVM-only framework
> > ---
> >   xen/common/Makefile                    |   1 +
> >   xen/common/save.c                      | 329 +++++++++++++++++++++++++
> >   xen/include/public/arch-arm/hvm/save.h |   5 +
> >   xen/include/public/arch-x86/hvm/save.h |   5 +
> >   xen/include/public/save.h              |  84 +++++++
> >   xen/include/xen/save.h                 | 152 ++++++++++++
> >   6 files changed, 576 insertions(+)
> >   create mode 100644 xen/common/save.c
> >   create mode 100644 xen/include/public/save.h
> >   create mode 100644 xen/include/xen/save.h
> >
> > diff --git a/xen/common/Makefile b/xen/common/Makefile
> > index e8cde65370..90553ba5d7 100644
> > --- a/xen/common/Makefile
> > +++ b/xen/common/Makefile
> > @@ -37,6 +37,7 @@ obj-y += radix-tree.o
> >   obj-y += rbtree.o
> >   obj-y += rcupdate.o
> >   obj-y += rwlock.o
> > +obj-y += save.o
> >   obj-y += shutdown.o
> >   obj-y += softirq.o
> >   obj-y += sort.o
> > diff --git a/xen/common/save.c b/xen/common/save.c
> > new file mode 100644
> > index 0000000000..6cdac3785b
> > --- /dev/null
> > +++ b/xen/common/save.c
> > @@ -0,0 +1,329 @@
> > +/*
> > + * save.c: Save and restore PV guest state common to all domain types.
> > + *
> > + * Copyright Amazon.com Inc. or its affiliates.
> > + *
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License 
> > for
> > + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License along 
> > with
> > + * this program; If not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <xen/save.h>
> > +
> > +union domain_copy_entry {
> > +    domain_write_entry write;
> > +    domain_read_entry read;
> > +};
> > +
> > +struct domain_context {
> > +    bool log;
> > +    struct domain_save_descriptor desc;
> > +    size_t data_len;
> 
> What is data_len?
> 

It’s used for internal accounting.

> > +    union domain_copy_entry copy;
> > +    void *priv;
> > +};
> > +
> > +static struct {
> > +    const char *name;
> > +    bool per_vcpu;
> > +    domain_save_handler save;
> > +    domain_load_handler load;
> > +} handlers[DOMAIN_SAVE_CODE_MAX + 1];
> > +
> > +void __init domain_register_save_type(unsigned int tc, const char *name,
> > +                                      bool per_vcpu,
> > +                                      domain_save_handler save,
> > +                                      domain_load_handler load)
> > +{
> > +    BUG_ON(tc > ARRAY_SIZE(handlers));
> > +
> > +    ASSERT(!handlers[tc].save);
> > +    ASSERT(!handlers[tc].load);
> > +
> > +    handlers[tc].name = name;
> > +    handlers[tc].per_vcpu = per_vcpu;
> > +    handlers[tc].save = save;
> > +    handlers[tc].load = load;
> > +}
> > +
> > +int domain_save_begin(struct domain_context *c, unsigned int tc,
> > +                      const char *name, const struct vcpu *v, size_t len)
> > +{
> > +    int rc;
> > +
> > +    if ( c->log )
> > +        gdprintk(XENLOG_INFO, "%pv save: %s (%lu)\n", v, name,
> > +                 (unsigned long)len);
> 
> How about using %zu rather than a cast here?
> 

Yes, that would be better.

> > +
> > +    BUG_ON(tc != c->desc.typecode);
> > +    BUG_ON(v->vcpu_id != c->desc.vcpu_id);
> 
> I can't find any answer on my question about this part. I understand the
> code would be buggy if this happen, but is it warrant to crash the host?
> Couldn't you just return an error and continue to run?
> 

Since it means buggy code I could ASSERT and error out, yes.

> > +
> > +    ASSERT(!c->data_len);
> > +    c->data_len = c->desc.length = len;
> > +
> > +    rc = c->copy.write(c->priv, &c->desc, sizeof(c->desc));
> > +    if ( rc )
> > +        return rc;
> > +
> > +    c->desc.length = 0;
> 
> This is confusing, why would you need to reset c->desc.length but not
> the rest of the fields?
> 

Because I use it to accumulate the length of the saved data and then cross 
check it against data_len in domain_save_end() below.

> > +
> > +    return 0;
> > +}
> > +
> > +int domain_save_data(struct domain_context *c, const void *src, size_t len)
> > +{
> > +    if ( c->desc.length + len > c->data_len )
> > +        return -ENOSPC;
> > +
> > +    c->desc.length += len;
> > +
> > +    return c->copy.write(c->priv, src, len);
> > +}
> > +
> > +int domain_save_end(struct domain_context *c)
> > +{
> > +    /*
> > +     * If desc.length does not match the length specified in
> > +     * domain_save_begin(), there should have been more data.
> > +     */
> > +    if ( c->desc.length != c->data_len )
> 
> This suggests you know in advance the size of the record.

That depends on what you mean by 'in advance'. I'd expect the caller of 
domain_save_begin() to know exactly.

> I have seen
> some cases where we don't know the size in advance. A good example if
> when saving grants. You know the maximum of grant used by the guest but
> you don't know yet the number of grants used. You might need to walk the
> "list" twice or allocate a temporary buffer. None of them are ideal.
> 
> Another example is when saving memory, we may want to compact page
> informations to save space.
> 
> This problem is going to be more relevant for LiveUpdate where we need
> to be able to create the stream in a very short amount of time.
> Allocating any temporary buffer and/or walking the list twice is going
> to kill the performance.
> 
> I would suggest to consider a different approach where you update the
> record length at the end.
> 
> FWIW, this below the current approach for the LU stream (IIRC David sent
> a prototype recently). A record is opened using lu_stream_open_record(),
> you then have two way to add data:
>      - lu_stream_append() -> This takes a buffer and write to the stream.
>      - lu_stream_reserve() -> Pre-allocate space in the stream and
> return a pointer to the beginning of the reserved region.
>      - lu_stream_end_reservation() -> Takes the actual size in parameter
> and update the stream size.
>      - lu_stream_close_record() -> Update the header with the total
> length and update the position in the stream.
> 

That sounds quite LU specific. For LM we still need to know up-front the 
maximal size of the buffer, and I was trying to work on the basis of never 
having to update previously saved data but I guess there's no actual harm in 
doing so, so we could avoid domain_save_begin() specifying the length.

> > +        return -EIO;
> 
> I noticed that all the pasding check have been dropped. I still think we
> need implicit padding to harden the code.
> 

I'd still view that as buggy code.

> > +
> > +    c->data_len = 0;
> > +
> > +    return 0;
> > +}
> > +
> > +int domain_save(struct domain *d, domain_write_entry write, void *priv,
> > +                bool dry_run)
> > +{
> > +    struct domain_context c = {
> > +        .copy.write = write,
> > +        .priv = priv,
> > +        .log = !dry_run,
> > +    };
> > +    struct domain_save_header h = {
> > +        .magic = DOMAIN_SAVE_MAGIC,
> > +        .version = DOMAIN_SAVE_VERSION,
> > +    };
> > +    struct domain_save_header e;
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    ASSERT(d != current->domain);
> > +
> > +    if ( d->is_dying )
> 
> I can't find an answer to my question about d->is_dying. What if the
> guest die afterwards? What does protect us against domain_kill()?
> 
> [...]

As I said in a previous response, nothing protects against domain_kill(), this 
check is just supposed to avoid doing 'unnecessary' work for a domain we know 
is already dying. For LU though I guess we do need to save (some) state for 
even a dying domain, so the individual save handlers actually need to make the 
decision on whether they are going to do anything.

> 
> > +int domain_load(struct domain *d, domain_read_entry read, void *priv)
> > +{
> > +    struct domain_context c = {
> > +        .copy.read = read,
> > +        .priv = priv,
> > +        .log = true,
> > +    };
> > +    struct domain_save_header h;
> > +    int rc;
> > +
> > +    ASSERT(d != current->domain);
> > +
> > +    if ( d->is_dying )
> > +        return -EINVAL;
> 
> Same here.
> 
> 
> > diff --git a/xen/include/public/save.h b/xen/include/public/save.h
> > new file mode 100644
> > index 0000000000..7e5f8752bd
> > --- /dev/null
> > +++ b/xen/include/public/save.h
> > @@ -0,0 +1,84 @@
> > +/*
> > + * save.h
> > + *
> > + * Structure definitions for common PV/HVM domain state that is held by
> > + * Xen and must be saved along with the domain's memory.
> > + *
> > + * Copyright Amazon.com Inc. or its affiliates.
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a 
> > copy
> > + * of this software and associated documentation files (the "Software"), to
> > + * deal in the Software without restriction, including without limitation 
> > the
> > + * rights to use, copy, modify, merge, publish, distribute, sublicense, 
> > and/or
> > + * sell copies of the Software, and to permit persons to whom the Software 
> > is
> > + * furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be included 
> > in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 
> > OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 
> > THE
> > + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> > + * DEALINGS IN THE SOFTWARE.
> > + */
> > +
> > +#ifndef __XEN_PUBLIC_SAVE_H__
> > +#define __XEN_PUBLIC_SAVE_H__
> 
> Does this header need to be exposed outside of Xen and the tools?
> 

Probably not.

> > +
> > +#include "xen.h"
> > +
> > +/* Each entry is preceded by a descriptor */
> > +struct domain_save_descriptor {
> > +    uint16_t typecode;
> > +    /*
> > +     * Each entry will contain either to global or per-vcpu domain state.
> > +     * Entries relating to global state should have zero in this field.
> > +     */
> > +    uint16_t vcpu_id;
> > +    uint32_t flags;
> > +    /*
> > +     * When restoring state this flag can be set in a descriptor to cause
> > +     * its content to be ignored.
> 
> Could you give examples where you would want to ignore a descriptor?
> 

I was thinking of the case when, e.g. you want to update something in the 
shared_info... You save a context blob, modify the shared_info record, and then 
restore the context blob with all other records marked as 'ignore' since they 
have not been modified.

> > +     *
> > +     * NOTE: It is invalid to set this flag for HEADER or END records (see
> > +     *       below).
> > +     */
> > +#define _DOMAIN_SAVE_FLAG_IGNORE 0
> > +#define DOMAIN_SAVE_FLAG_IGNORE (1u << _DOMAIN_SAVE_FLAG_IGNORE)
> > +
> > +    /* Entry length not including this descriptor */
> > +    uint64_t length;
> > +};
> > +
> > +/*
> > + * Each entry has a type associated with it. DECLARE_DOMAIN_SAVE_TYPE
> > + * binds these things together.
> > + */
> > +#define DECLARE_DOMAIN_SAVE_TYPE(_x, _code, _type) \
> > +    struct __DOMAIN_SAVE_TYPE_##_x { char c[_code]; _type t; };
> > +
> > +#define DOMAIN_SAVE_CODE(_x) \
> > +    (sizeof(((struct __DOMAIN_SAVE_TYPE_##_x *)(0))->c))
> > +#define DOMAIN_SAVE_TYPE(_x) \
> > +    typeof(((struct __DOMAIN_SAVE_TYPE_##_x *)(0))->t)
> > +
> > +/* Terminating entry */
> > +struct domain_save_end {};
> > +DECLARE_DOMAIN_SAVE_TYPE(END, 0, struct domain_save_end);
> > +
> > +#define DOMAIN_SAVE_MAGIC   0x53415645
> > +#define DOMAIN_SAVE_VERSION 0x00000001
> > +
> > +/* Initial entry */
> > +struct domain_save_header {
> > +    uint32_t magic;             /* Must be DOMAIN_SAVE_MAGIC */
> > +    uint32_t version;           /* Save format version */
> 
> 
> I haven't seen any answer about xen version here. For the record:
> 
> "Let's imagine in 4.14 we introduced a bug in the saving part. This is
> solved in 4.15 but somehow the version is not bumped. How would you
> differentiate the streams saved by Xen 4.14 so you can still migrate?

I'd assume testing would at least save and restore on 4.14. If we then fixed a 
bug then why would we not bump the version?

> 
> If you record the version of Xen in the record automatically, then you
> at least have a way to differentiate the two versions."
> 

OK, I guess redundant version information is not going to be harmful.

  Paul




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.