[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3] RFC: extend the xenstore ring with a 'closing' signal



> -----Original Message-----
> From: David Scott [mailto:dave.scott@xxxxxxxxxx]
> Sent: 03 July 2014 16:15
> To: xen-devel@xxxxxxxxxxxxxxxxxxxx
> Cc: Paul Durrant; Ian Campbell; john.liuqiming@xxxxxxxxxx;
> yanqiangjun@xxxxxxxxxx; Andrew Cooper; Dave Scott
> Subject: [PATCH v3] RFC: extend the xenstore ring with a 'closing' signal
> 
> Currently hvmloader uses the xenstore ring and then tries to
> reset it back to its initial state. This is not part of the
> ring protocol and, if xenstored reads the ring while it is
> happening, xenstored will conclude it is corrupted. A corrupted
> ring will prevent PV drivers from connecting. This seems to
> be a rare failure.
> 
> Furthermore, when a VM crashes it may jump to a 'crash kernel'
> to create a diagnostic dump. Without the ability to safely
> reset the ring the PV drivers won't be able to reliably
> establish connections, to (for example) stream a memory dump to
> disk.
> 
> This prototype patch contains a simple extension of the
> xenstore ring structure, enough to contain version numbers and
> a 'closing' flag. This memory is currently unused within the
> 4k page and should be zeroed as part of the domain setup
> process. The oxenstored server advertises version 1, and
> hvmloader detects this and sets the 'closing' flag. The server
> then reinitialises the ring, filling it with obviously invalid
> data to help debugging (unfortunately blocks of zeroes are
> valid xenstore packets) and signals hvmloader by the event
> channel that it's safe to boot the guest OS.
> 
> Updates since v2 (thanks to Andy Cooper for the review):
> * hvmloader: use volatile for read of closing flag
> * style improvements
> * remove xenstore version #defines
> 
> Updates since v1 (thanks to Paul Durrant for the review):
> * remove unused client version and associated dead code
> * treat 'closing' as a flag by using "!=0" rather than "==1"
> * hvmloader: move function up and remove forward decl
> * document the existing xenstore ring and the extention in misc/
> 
> Signed-off-by: David Scott <dave.scott@xxxxxxxxxx>
> ---
>  docs/misc/xenstore-ring.txt         |   79
> +++++++++++++++++++++++++++++++++++
>  tools/firmware/hvmloader/xenbus.c   |   39 +++++++++--------
>  tools/ocaml/libs/xb/xb.ml           |   26 +++++++++++-
>  tools/ocaml/libs/xb/xb.mli          |    1 +
>  tools/ocaml/libs/xb/xs_ring.ml      |   11 +++++
>  tools/ocaml/libs/xb/xs_ring_stubs.c |   63
> +++++++++++++++++++++++++++-
>  xen/include/public/io/xs_wire.h     |    5 +++
>  7 files changed, 203 insertions(+), 21 deletions(-)
>  create mode 100644 docs/misc/xenstore-ring.txt
> 
> diff --git a/docs/misc/xenstore-ring.txt b/docs/misc/xenstore-ring.txt
> new file mode 100644
> index 0000000..df2a09f
> --- /dev/null
> +++ b/docs/misc/xenstore-ring.txt
> @@ -0,0 +1,79 @@
> +
> +The domain builder allocates a single page and shares it with the xenstore
> +daemon. This document describes the ring protocol used to communicate
> via
> +this page which is used to transmit and receive
> +[xenstore protocol packets](xenstore.txt).
> +
> +In the original version (we call this "version 0"), the shared page has the
> +following contents (all offsets and lengths are in octets):
> +
> +Offset  Length  Description
> +-----------------------------------------------------------------
> +0       1024    Requests: data sent to the xenstore daemon
> +1024    1024    Replies: data sent to the domain
> +2048    4       Request consumer offset
> +2052    4       Request producer offset
> +2056    4       Response consumer offset
> +2060    4       Response producer offset
> +
> +When the page is allocated it is guaranteed to be full of zeroes.
> +
> +The "Requests" and "Replies" are treated as circular buffers, one for
> +each direction. Each circular buffer is associated wth a producer and
> +consumer offset, which are free-running counters starting from 0. A
> "producer"
> +offset is the offset in the byte stream of the next byte to be written; a
> +"consumer" offset is the offset in the byte stream of the next byte to be
> +read. The byte at offset 'x' in the byte stream will be stored in
> +offset 'x mod 1024' in the circular buffer. "producer - consumer" gives
> +the number of bytes still to be read, and "1024 - (producer - consumer)"
> +therefore gives the amount of space currently available for writing,
> +where we must avoid overwriting unread data.
> +
> +The circular buffers are only used to send sequences of bytes between
> domains.
> +It is the responsibility of the layer above to segment these bytes into
> +packets, as described in [xenstore.txt](xenstore.txt).
> +
> +The client and server domains can run concurrently on different cores and
> +different sockets. We must therefore take care to avoid corruption by:
> +
> +  1. using atomic loads and stores when reading and writing the producer
> +     and consumer offsets. If an offset were to be updated by a non-atomic
> +     store then the other domain may read an invalid offset value.
> +  2. ensuring request and reply data is fully read or written before
> +     updating the offsets by issuing a memory barrier.
> +
> +If we wish to read data but find the circular buffer empty, or wish to write
> +data and find the circular buffer full, we wait on a pre-arranged event
> +channel. When the other party next reads or writes data to the ring, it
> +will notify() this event channel and we will wake.
> +
> +Protocol extension for reconnection
> +-----------------------------------
> +
> +In version 0 of the protocol it is not possible to close and reopen the
> +connection. This means that if the ring is corrupted, it can never be used
> +(safely) again. Extending the protocol to allow reconnection would allow
> +us to:
> +
> +  1. use the ring in the firmware (hvmloader) and safely reset it for use
> +     by the guest
> +  2. re-establish a ring in a 'crash kernel' environment, allowing us to
> +     write crash dumps to PV disks or network devices.
> +
> +In version 1 of the protocol the ring is extended with the following
> +fields:
> +
> +Offset  Length  Description
> +-----------------------------------------------------------------
> +2064    4       Xenstore daemon supported protocol version
> +2068    4       Close request flag
> +
> +In a system supporting only version 0, both these fields are guaranteed
> +to contain 0 by the domain builder.
> +
> +In a system supporting version 1, the xenstore daemon will write "1" into
> +the support protocol version field. The guest xenstore client (eg in

s/support/supported

> +hvmloader) can query the version, and if it is set to "1" it can write
> +"1" to the close request flag and call notify(). The supporting xenstore

Perhaps, rather than 'call notify()' this should be 'signal the store event 
channel'?

> +daemon can reset the ring to an empty state and signal completion by
> +clearing the flag and calling notify() again.

Same here.

> diff --git a/tools/firmware/hvmloader/xenbus.c
> b/tools/firmware/hvmloader/xenbus.c
> index fe72e97..f85832c 100644
> --- a/tools/firmware/hvmloader/xenbus.c
> +++ b/tools/firmware/hvmloader/xenbus.c
> @@ -37,6 +37,19 @@ static struct xenstore_domain_interface *rings; /*
> Shared ring with dom0 */
>  static evtchn_port_t event;                     /* Event-channel to dom0 */
>  static char payload[XENSTORE_PAYLOAD_MAX + 1];  /* Unmarshalling area
> */
> 
> +static void ring_wait(void)
> +{
> +    struct shared_info *shinfo = get_shared_info();
> +    struct sched_poll poll;
> +
> +    memset(&poll, 0, sizeof(poll));
> +    set_xen_guest_handle(poll.ports, &event);
> +    poll.nr_ports = 1;
> +
> +    while ( !test_and_clear_bit(event, shinfo->evtchn_pending) )
> +        hypercall_sched_op(SCHEDOP_poll, &poll);
> +}
> +
>  /* Connect our xenbus client to the backend.
>   * Call once, before any other xenbus actions. */
>  void xenbus_setup(void)
> @@ -68,10 +81,15 @@ void xenbus_shutdown(void)
> 
>      ASSERT(rings != NULL);
> 
> -    /* We zero out the whole ring -- the backend can handle this, and it's
> -     * not going to surprise any frontends since it's equivalent to never
> -     * having used the rings. */
> -    memset(rings, 0, sizeof *rings);
> +    if (rings->server_version > 0) {
> +        rings->closing = 1;
> +        while (*(volatile uint32_t*)&rings->closing == 1)
> +            ring_wait ();
> +    } else
> +        /* If the backend reads the state while we're erasing it then the
> +         * ring state will become corrupted, preventing guest frontends from
> +         * connecting. This is rare. */
> +        memset(rings, 0, sizeof *rings);
> 
>      /* Clear the event-channel state too. */
>      memset(shinfo->vcpu_info, 0, sizeof(shinfo->vcpu_info));
> @@ -81,19 +99,6 @@ void xenbus_shutdown(void)
>      rings = NULL;
>  }
> 
> -static void ring_wait(void)
> -{
> -    struct shared_info *shinfo = get_shared_info();
> -    struct sched_poll poll;
> -
> -    memset(&poll, 0, sizeof(poll));
> -    set_xen_guest_handle(poll.ports, &event);
> -    poll.nr_ports = 1;
> -
> -    while ( !test_and_clear_bit(event, shinfo->evtchn_pending) )
> -        hypercall_sched_op(SCHEDOP_poll, &poll);
> -}
> -
>  /* Helper functions: copy data in and out of the ring */
>  static void ring_write(const char *data, uint32_t len)
>  {
> diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
> index 29d354d..d5cd776 100644
> --- a/tools/ocaml/libs/xb/xb.ml
> +++ b/tools/ocaml/libs/xb/xb.ml
> @@ -84,7 +84,26 @@ let write con s len =
>       | Fd backfd     -> write_fd backfd con s len
>       | Xenmmap backmmap -> write_mmap backmmap con s len
> 
> -let output con =
> +(* If a function throws Xs_ring.Closing, then clear the ring state
> +   and serve the ring again. *)
> +let rec handle_closing f con =
> +     match (try Some (f con) with Xs_ring.Closing -> None) with
> +     | Some x -> x
> +     | None ->
> +             begin match con.backend with
> +             | Fd _ -> raise Xs_ring.Closing (* should never happen, but
> just in case *)
> +             | Xenmmap backend ->
> +                     Xs_ring.close backend.mmap;
> +                     backend.eventchn_notify ();
> +                     (* Clear our old connection state *)
> +                     Queue.clear con.pkt_in;
> +                     Queue.clear con.pkt_out;
> +                     con.partial_in <- init_partial_in ();
> +                     con.partial_out <- "";
> +                     handle_closing f con
> +             end
> +
> +let output = handle_closing (fun con ->
>       (* get the output string from a string_of(packet) or partial_out *)
>       let s = if String.length con.partial_out > 0 then
>                       con.partial_out
> @@ -101,8 +120,9 @@ let output con =
>       );
>       (* after sending one packet, partial is empty *)
>       con.partial_out = ""
> +)
> 
> -let input con =
> +let input = handle_closing (fun con ->
>       let newpacket = ref false in
>       let to_read =
>               match con.partial_in with
> @@ -133,6 +153,7 @@ let input con =
>                       HaveHdr (Partial.of_string buf) else NoHdr (i - sz, buf)
>       );
>       !newpacket
> +)
> 
>  let newcon backend = {
>       backend = backend;
> @@ -145,6 +166,7 @@ let newcon backend = {
>  let open_fd fd = newcon (Fd { fd = fd; })
> 
>  let open_mmap mmap notifyfct =
> +     Xs_ring.set_server_version mmap 1; (* defined in xs_wire.h *)
>       newcon (Xenmmap {
>               mmap = mmap;
>               eventchn_notify = notifyfct;
> diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
> index 58234ae..7f65fa3 100644
> --- a/tools/ocaml/libs/xb/xb.mli
> +++ b/tools/ocaml/libs/xb/xb.mli
> @@ -80,6 +80,7 @@ val read : t -> string -> int -> int
>  val write_fd : backend_fd -> 'a -> string -> int -> int
>  val write_mmap : backend_mmap -> 'a -> string -> int -> int
>  val write : t -> string -> int -> int
> +val handle_closing : (t -> 'a) -> t -> 'a
>  val output : t -> bool
>  val input : t -> bool
>  val newcon : backend -> t
> diff --git a/tools/ocaml/libs/xb/xs_ring.ml b/tools/ocaml/libs/xb/xs_ring.ml
> index 9469609..d7f0fd4 100644
> --- a/tools/ocaml/libs/xb/xs_ring.ml
> +++ b/tools/ocaml/libs/xb/xs_ring.ml
> @@ -14,5 +14,16 @@
>   * GNU Lesser General Public License for more details.
>   *)
> 
> +exception Closing
> +
> +let _ =
> +  Callback.register_exception "Xs_ring.Closing" Closing
> +
>  external read: Xenmmap.mmap_interface -> string -> int -> int =
> "ml_interface_read"
>  external write: Xenmmap.mmap_interface -> string -> int -> int =
> "ml_interface_write"
> +
> +
> +external set_server_version: Xenmmap.mmap_interface -> int -> unit =
> "ml_interface_set_server_version" "noalloc"
> +external get_server_version: Xenmmap.mmap_interface -> int =
> "ml_interface_get_server_version" "noalloc"
> +
> +external close: Xenmmap.mmap_interface -> unit = "ml_interface_close"
> "noalloc"
> diff --git a/tools/ocaml/libs/xb/xs_ring_stubs.c
> b/tools/ocaml/libs/xb/xs_ring_stubs.c
> index 8bd1047..27c98cd 100644
> --- a/tools/ocaml/libs/xb/xs_ring_stubs.c
> +++ b/tools/ocaml/libs/xb/xs_ring_stubs.c
> @@ -35,19 +35,28 @@
> 
>  #define GET_C_STRUCT(a) ((struct mmap_interface *) a)
> 
> +#define ERROR_UNKNOWN (-1)
> +#define ERROR_CLOSING (-2)
> +
>  static int xs_ring_read(struct mmap_interface *interface,
>                               char *buffer, int len)
>  {
>       struct xenstore_domain_interface *intf = interface->addr;
>       XENSTORE_RING_IDX cons, prod; /* offsets only */
>       int to_read;
> +     uint32_t closing;
> 
>       cons = *(volatile uint32*)&intf->req_cons;
>       prod = *(volatile uint32*)&intf->req_prod;
> +     closing = *(volatile uint32*)&intf->closing;
> +
> +     if (closing != 0)
> +             return ERROR_CLOSING;
> +
>       xen_mb();
> 
>       if ((prod - cons) > XENSTORE_RING_SIZE)
> -         return -1;
> +         return ERROR_UNKNOWN;
> 
>       if (prod == cons)
>               return 0;
> @@ -71,9 +80,15 @@ static int xs_ring_write(struct mmap_interface
> *interface,
>       struct xenstore_domain_interface *intf = interface->addr;
>       XENSTORE_RING_IDX cons, prod;
>       int can_write;
> +     uint32_t closing;
> 
>       cons = *(volatile uint32*)&intf->rsp_cons;
>       prod = *(volatile uint32*)&intf->rsp_prod;
> +     closing = *(volatile uint32*)&intf->closing;
> +
> +     if (closing != 0)
> +             return ERROR_CLOSING;
> +
>       xen_mb();
>       if ( (prod - cons) >= XENSTORE_RING_SIZE )
>               return 0;
> @@ -97,8 +112,12 @@ CAMLprim value ml_interface_read(value interface,
> value buffer, value len)
> 
>       res = xs_ring_read(GET_C_STRUCT(interface),
>                          String_val(buffer), Int_val(len));
> -     if (res == -1)
> +     if (res == ERROR_UNKNOWN)
>               caml_failwith("bad connection");
> +
> +     if (res == ERROR_CLOSING)
> +

This looks a bit wrong. The blank line above should be removed and the 
following line indented appropriately.

>       caml_raise_constant(*caml_named_value("Xs_ring.Closing"));
> +
>       result = Val_int(res);
>       CAMLreturn(result);
>  }
> @@ -111,6 +130,46 @@ CAMLprim value ml_interface_write(value interface,
> value buffer, value len)
> 
>       res = xs_ring_write(GET_C_STRUCT(interface),
>                           String_val(buffer), Int_val(len));
> +
> +     if (res == ERROR_CLOSING)
> +

Same here.

>       caml_raise_constant(*caml_named_value("Xs_ring.Closing"));
> +
>       result = Val_int(res);
>       CAMLreturn(result);
>  }
> +
> +CAMLprim value ml_interface_set_server_version(value interface, value v)
> +{
> +     CAMLparam2(interface, v);
> +     struct xenstore_domain_interface *intf = GET_C_STRUCT(interface)-
> >addr;
> +
> +     intf->server_version = Int_val(v);
> +
> +     CAMLreturn(Val_unit);
> +}
> +
> +CAMLprim value ml_interface_get_server_version(value interface)
> +{
> +     CAMLparam1(interface);
> +     struct xenstore_domain_interface *intf = GET_C_STRUCT(interface)-
> >addr;
> +
> +     CAMLreturn(Val_int (intf->server_version));
> +}
> +
> +CAMLprim value ml_interface_close(value interface)
> +{
> +     CAMLparam1(interface);
> +     const char invalid_data[] = { 'd', 'e', 'a', 'd', 'b', 'e', 'e', 'f' };
> +     struct xenstore_domain_interface *intf = GET_C_STRUCT(interface)-
> >addr;
> +     int i;
> +
> +     intf->req_cons = intf->req_prod = intf->rsp_cons = intf->rsp_prod =
> 0;
> +     /* Ensure the unused space is full of invalid xenstore packets. */
> +     for (i = 0; i < XENSTORE_RING_SIZE; i++) {
> +             intf->req[i] = invalid_data[i % sizeof(invalid_data)];
> +             intf->rsp[i] = invalid_data[i % sizeof(invalid_data)];
> +     }

This does not reset the rings to their initial state (i.e. all zeroes). I don't 
think that's necessarily a problem, but it is inconsistent.

> +     xen_mb ();
> +     intf->closing = 0;
> +     CAMLreturn(Val_unit);
> +}
> diff --git a/xen/include/public/io/xs_wire.h
> b/xen/include/public/io/xs_wire.h
> index 585f0c8..022d614 100644
> --- a/xen/include/public/io/xs_wire.h
> +++ b/xen/include/public/io/xs_wire.h
> @@ -108,14 +108,19 @@ enum xs_watch_type
>   * `incontents 150 xenstore_struct XenStore wire protocol.
>   *
>   * Inter-domain shared memory communications. */
> +

Pure whitespace change. Not necessary.

  Paul

>  #define XENSTORE_RING_SIZE 1024
>  typedef uint32_t XENSTORE_RING_IDX;
>  #define MASK_XENSTORE_IDX(idx) ((idx) & (XENSTORE_RING_SIZE-1))
>  struct xenstore_domain_interface {
> +    /* XENSTORE_VERSION_0 */
>      char req[XENSTORE_RING_SIZE]; /* Requests to xenstore daemon. */
>      char rsp[XENSTORE_RING_SIZE]; /* Replies and async watch events. */
>      XENSTORE_RING_IDX req_cons, req_prod;
>      XENSTORE_RING_IDX rsp_cons, rsp_prod;
> +    uint32_t server_version;
> +    /* server_version 1 and later: */
> +    uint32_t closing;             /* Non-zero means close in progress */
>  };
> 
>  /* Violating this is very bad.  See docs/misc/xenstore.txt. */
> --
> 1.7.10.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.