[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] remus: implement remus replicated checkpointing disk



On Tue, Feb 25, 2014 at 6:53 PM, Lai Jiangshan <laijs@xxxxxxxxxxxxxx> wrote:
This patch implements remus replicated checkpointing disk.
It includes two parts:
  generic remus replicated checkpointing disks framework
  drbd replicated checkpointing disks
They will be split into different files in next round.

The patch is still simple due to disk-setup-teardown-script is
still under implementing. I need to use libxl_ao to implement it,
but libxl_ao is hard to use. The work sequence is needed to ugly split
to serveral callbacks like device_hotplug().

And becuase the remus disk script is unimplemented, the drbd_setup() code
can't check the disk now. So it just assumes the user config the disk correctly.

This patch is *UNTESTED*.
(there is a problem with xl&drbd(without remus) in my BOXes).

I request *comments* as many as possible.

Thanks,
Lai

Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxx>


Hi
 sorry for the delayed response. And thanks a lot for this initiative. Apart from the inline feedback,
there are a few things to consider first before going down this route. 

1. The drbd kernel module required for Remus is still out of tree, currently hosted on a wiki page.
The drbd folks didnt want to include the changes into their code unfortunately, as they were offering the
same functionality to one of their paid customers. This is what they told me back in 2011 or so.

To streamline the storage replication module installation, is there a chance of hosting the code in 
xen.org's repos? That way, we could script the download and installation process. Like the qemu
stuff.

2. The tapdisk based replication unfortunately is outdated. Please correct me if I have got this wrong.
Haven't we decided to get rid of blktap2 and go with the qemu disk models? In which case, the tapdisk
remus code has to be ported into some qemu disk variant.

Without getting a resolution to the above two, my stance is that we shouldn't pollute xl with functionality
that requires out-of-band modules that may prove pretty painful to install for the majority of folks out there.

Based on the experience from the last 3 years, most average users of Remus tend to skip disk replication 
altogether.  They install the distro's default drbd, use the disk replication provided with it and then complain
that Remus crashes or fails.  Some have ventured into tapdisk replication but it unfortunately seemed to 
get difficult as xend/blktap2 started getting deprecated.

 
---
 tools/libxl/Makefile                      |    1 +
 tools/libxl/libxl_dom.c                   |   19 +++-
 tools/libxl/libxl_internal.h              |   10 ++
 tools/libxl/libxl_remus.c                 |    2 +
 tools/libxl/libxl_remus_replicated_disk.c |  219 +++++++++++++++++++++++++++++
 5 files changed, 249 insertions(+), 2 deletions(-)
 create mode 100644 tools/libxl/libxl_remus_replicated_disk.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 218f55e..dbf5dd9 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -53,6 +53,7 @@ LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif

 LIBXL_OBJS-y += libxl_remus.o
+LIBXL_OBJS-y += libxl_remus_replicated_disk.o


So I think this part will also require some autoconf based stuff.
Especially, if DRBD & tapdisk are not present, then this whole
thing gets disabled. Just like the libxl_netbuffer and libxl_nonetbuffer

Given that both of these (netbuffer and disk) are associated with Remus
and both are required for Remus to work "correctly", we might as well have
noremus.c and remus.c . Ofcourse it can be modularized a bit to have
netbuffer but no disk replication or vice versa. As long as the person installing
or compiling this stuff is made to state explicitly that he/she does not want
Remus, but only a subset of its functionality for some other purpose.
 

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index a4ffdfd..858f5be 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c

 
     /* The domain was suspended successfully. Start a new network
@@ -1279,7 +1284,10 @@ static int libxl__remus_domain_resume_callback(void *data)
     if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1))
         return 0;

-    /* REMUS TODO: Deal with disk. */
+    /* Deal with disk. */
+    if (libxl__remus_disks_preresume(dss->remus_state))
+        return 0;
+

Bug. This should go before the resume call. Also, I would suggest changing the comment 
to something more meaningful, e.g., "commit disk changes.."


     return 1;
 }

@@ -1326,6 +1334,13 @@ static void remus_checkpoint_dm_saved(libxl__egc *egc,
         goto out;
     }

+    rc = libxl__remus_disks_commit(remus_state);
+    if (rc) {
+        LOG(ERROR, "Failed to commit disks state"
+            " Terminating Remus..");
+        goto out;
+    }
+

Now might be a good time to use the restore callbacks offered by the toolstack to 
get an explicit ack from the backup that it has received the memory checkpoint too
before the network buffer is released. I think I put in a comment related to that somewhere.

 
     if (remus_state->netbuf_state) {
         rc = libxl__remus_netbuf_release_prev_epoch(gc, dss->domid,
                                                     remus_state);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 1bd2bba..8933e5f 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2309,6 +2309,10 @@ typedef struct libxl__remus_state {
     void *netbuf_state;
     libxl__ev_time timeout;
     libxl__ev_child child;
+
+    /* remus disks state */
+    uint32_t nr_disks;
+    struct libxl__remus_disk **disks;
 } libxl__remus_state;

 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
@@ -2336,6 +2340,12 @@ _hidden int libxl__remus_netbuf_start_new_epoch(libxl__gc *gc, uint32_t domid,
 _hidden int libxl__remus_netbuf_release_prev_epoch(libxl__gc *gc, uint32_t domid,
                                                   libxl__remus_state *remus_state);

+_hidden int libxl__remus_disks_postsuspend(libxl__remus_state *state);
+_hidden int libxl__remus_disks_preresume(libxl__remus_state *state);
+_hidden int libxl__remus_disks_commit(libxl__remus_state *state);
+_hidden int libxl__remus_disks_setup(libxl__egc *egc, libxl__domain_suspend_state *dss);
+_hidden void libxl__remus_disks_teardown(libxl__remus_state *state);
+
 _hidden void libxl__remus_setup_initiate(libxl__egc *egc,
                                          libxl__domain_suspend_state *dss);

diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
index cdc1c16..92eb36a 100644
--- a/tools/libxl/libxl_remus.c
+++ b/tools/libxl/libxl_remus.c
@@ -23,6 +23,7 @@ void libxl__remus_setup_initiate(libxl__egc *egc,
                                  libxl__domain_suspend_state *dss)
 {
     libxl__ev_time_init(&dss->remus_state->timeout);
+    libxl__remus_disks_setup(egc, dss);
     if (!dss->remus_state->netbufscript)
         libxl__remus_setup_done(egc, dss, 0);
     else
@@ -51,6 +52,7 @@ void libxl__remus_teardown_initiate(libxl__egc *egc,
     /* stash rc somewhere before invoking teardown ops. */
     dss->remus_state->saved_rc = rc;

+    libxl__remus_disks_teardown(dss->remus_state);
     if (!dss->remus_state->netbuf_state)
         libxl__remus_teardown_done(egc, dss);
     else
diff --git a/tools/libxl/libxl_remus_replicated_disk.c b/tools/libxl/libxl_remus_replicated_disk.c
new file mode 100644
index 0000000..4b16403
--- /dev/null
+++ b/tools/libxl/libxl_remus_replicated_disk.c
@@ -0,0 +1,219 @@
+/*
+ * Copyright (C) 2013
+ * Author Lai Jiangshan <laijs@xxxxxxxxxxxxxx>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+typedef struct libxl__remus_disk
+{
+    const struct libxl_device_disk *disk;
+    const struct libxl__remus_disk_type *type;
+
+    /* ao callbacks for setup & teardown script */
+    int (*setup_cb)(struct libxl__remus_disk *d);
+    int (*teardown_cb)(struct libxl__remus_disk *d);
+} libxl__remus_disk;
+
+typedef struct libxl__remus_disk_type
+{
+    /* checkpointing */
+    int (*postsuspend)(libxl__remus_disk *d);
+    int (*preresume)(libxl__remus_disk *d);
+    int (*commit)(libxl__remus_disk *d);
+

I would also suggest renaming these to something else, not associated with
suspend but associated with checkpoints. start_disk_sync, finish_disk_sync,
start_new_epoch or something along those lines.

 
+    /* setup & teardown */
+    libxl__remus_disk *(*setup)(libxl__gc *gc, libxl_device_disk *disk);
+    void (*teardown)(libxl__remus_disk *d); 
+} libxl__remus_disk_type;
+
+
+/*** drbd implementation ***/
+const int DRBD_SEND_CHECKPOINT = 20;
+const int DRBD_WAIT_CHECKPOINT_ACK = 30;
+typedef struct libxl__remus_drbd_disk
+{
+    libxl__remus_disk remus_disk;
+    int ctl_fd;
+    int ackwait;
+} libxl__remus_drbd_disk;
+
+static int drbd_postsuspend(libxl__remus_disk *d)
+{
+    struct libxl__remus_drbd_disk *drbd = CONTAINER_OF(d, *drbd, remus_disk);
+
+    if (!drbd->ackwait) {
+        if (ioctl(drbd->ctl_fd, DRBD_SEND_CHECKPOINT, 0) <= 0)
+            drbd->ackwait = 1;
+    }
+
+    return 0;
+}
+
+static int drbd_preresume(libxl__remus_disk *d)
+{
+    struct libxl__remus_drbd_disk *drbd = CONTAINER_OF(d, *drbd, remus_disk);
+
+    if (drbd->ackwait) {
+        ioctl(drbd->ctl_fd, DRBD_WAIT_CHECKPOINT_ACK, 0);
+        drbd->ackwait = 0;
+    }
+
+    return 0;
+}
+
+static int drbd_commit(libxl__remus_disk *d)
+{
+    /* nothing to do, all work are done by DRBD's protocal-D. */
+    return 0;
+}
+
+static libxl__remus_disk *drbd_setup(libxl__gc *gc, libxl_device_disk *disk)
+{
+    libxl__remus_drbd_disk *drbd;
+    //if (!(drbd && protocal-D)) // TODO: need to run script async to check
+    //  return NULL
+

We don't need to run any scripts for DRBD (or tapdisk for that matter).

DRBD scripts will get activated when the domain boots and thats the end of it.
On the backup side, it gets activated during the initial phase of Remus, which
is same as live migration.  Since xl already supports live migration with drbd
based disks, we don't need any script related code at all.

With regard to tapdisk-remus (atleast with blktap2), you cant boot the domain
fully unless you start Remus too. This in turn forces the backup to start the 
tapdisk-remus receiving end.  Once again in this case, in Xend, the live migration
infrastructure did all the script setup work.


 
+    GCNEW(drbd);
+
+    drbd->ctl_fd = open(GCSPRINTF("/dev/drbd/by-res/%s", disk->pdev_path), O_RDONLY);
+    drbd->ackwait = 0;
+
+    if (drbd->ctl_fd < 0)
+        return NULL;
+
+    return &drbd->remus_disk;
+}
+
+static void drbd_teardown(libxl__remus_disk *d)
+{
+    struct libxl__remus_drbd_disk *drbd = CONTAINER_OF(d, *drbd, remus_disk);
+
+    close(drbd->ctl_fd);
+}
+
+static const libxl__remus_disk_type drbd_disk_type = {
+  .postsuspend = drbd_postsuspend,
+  .preresume = drbd_preresume,
+  .commit = drbd_commit,
+  .setup = drbd_setup,
+  .teardown = drbd_teardown,
+};
+
+/*** checkpoint disks states and callbacks ***/
+static const libxl__remus_disk_type *remus_disk_types[] =
+{
+    &drbd_disk_type,
+};
+
+int libxl__remus_disks_postsuspend(libxl__remus_state *state)
+{
+    int i;
+    int rc = 0;
+
+    for (i = 0; rc == 0 && i < state->nr_disks; i++)
+        rc = state->disks[i]->type->postsuspend(state->disks[i]);
+
+    return rc;
+}
+
+int libxl__remus_disks_preresume(libxl__remus_state *state)
+{
+    int i;
+    int rc = 0;
+
+    for (i = 0; rc == 0 && i < state->nr_disks; i++)
+        rc = state->disks[i]->type->preresume(state->disks[i]);
+
+    return rc;
+}
+
+int libxl__remus_disks_commit(libxl__remus_state *state)
+{
+    int i;
+    int rc = 0;
+
+    for (i = 0; rc == 0 && i < state->nr_disks; i++)
+        rc = state->disks[i]->type->commit(state->disks[i]);
+
+    return rc;
+}
+
+#if 0
+/* TODO: implement disk setup/teardown script */
+static void disk_exec_timeout_cb(libxl__egc *egc, libxl__ev_time *ev,
+                                      const struct timeval *requested_abs)
+{
+    libxl__remus_disks_state *state = CONTAINER_OF(ev, *aodev, timeout);
+    STATE_AO_GC(state->ao);
+
+    libxl__ev_time_deregister(gc, &state->timeout);
+
+    assert(libxl__ev_child_inuse(&state->child));
+    if (kill(state->child.pid, SIGKILL)) {
+    }
+
+    return;
+}
+
+int libxl__remus_disks_exec_script(libxl__gc *gc,
+    libxl__remus_disks_state *state)
+{
+}
+#endif
+

I don't know if this is needed at all, given that we don't have disk script setup issues.
 
 
+int libxl__remus_disks_setup(libxl__egc *egc, libxl__domain_suspend_state *dss)
+{
+    libxl__remus_state *remus_state = dss->remus_state;
+    int i, j, nr_disks;
+    libxl_device_disk *disks;
+    libxl__remus_disk *remus_disk;
+    const libxl__remus_disk_type *type;
+
+    STATE_AO_GC(dss->ao);
+    disks = libxl_device_disk_list(CTX, dss->domid, &nr_disks);
+    remus_state->nr_disks = nr_disks;
+    GCNEW_ARRAY(remus_state->disks, nr_disks);
+
+    for (i = 0; i < nr_disks; i++) {
+        remus_disk = NULL;
+        for (j = 0; j < ARRAY_SIZE(remus_disk_types); j++) {
+            type = remus_disk_types[j];
+            remus_disk = type->setup(gc, &disks[i]);
+            if (!remus_disk)
+                break;
+
+            remus_state->disks[i] = remus_disk;
+            remus_disk->disk = &disks[i];
+            remus_disk->type = type;
+        }
+        if (!remus_disk) {
+            remus_state->nr_disks = i;
+            libxl__remus_disks_teardown(remus_state);
+            return -1;
+        }
+    }
+    return 0;
+}
+
+void libxl__remus_disks_teardown(libxl__remus_state *state)
+{
+    int i;
+
+    for (i = 0; i < state->nr_disks; i++)
+        state->disks[i]->type->teardown(state->disks[i]);
+    state->nr_disks = 0;
+}
+
--
1.7.1


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.