[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH Remus v5 2/2] libxc/restore: implement Remus checkpointed restore

To: Ian Campbell <ian.campbell@xxxxxxxxxx>
From: Yang Hongyang <yanghy@xxxxxxxxxxxxxx>
Date: Fri, 15 May 2015 09:32:29 +0800
Cc: wei.liu2@xxxxxxxxxx, eddie.dong@xxxxxxxxx, wency@xxxxxxxxxxxxxx, andrew.cooper3@xxxxxxxxxx, yunhong.jiang@xxxxxxxxx, ian.jackson@xxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxx, guijianfeng@xxxxxxxxxxxxxx, rshriram@xxxxxxxxx
Delivery-date: Fri, 15 May 2015 01:33:06 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>



On 05/14/2015 09:05 PM, Ian Campbell wrote:

On Thu, 2015-05-14 at 18:06 +0800, Yang Hongyang wrote:

With Remus, the restore flow should be:
the first full migration stream -> { periodically restore stream }

Signed-off-by: Yang Hongyang <yanghy@xxxxxxxxxxxxxx>
Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
CC: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
CC: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
CC: Wei Liu <wei.liu2@xxxxxxxxxx>
---
  tools/libxc/xc_sr_common.h  |  14 ++++++
  tools/libxc/xc_sr_restore.c | 113 ++++++++++++++++++++++++++++++++++++++++----
  2 files changed, 117 insertions(+), 10 deletions(-)

diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
index f8121e7..3bf27f1 100644
--- a/tools/libxc/xc_sr_common.h
+++ b/tools/libxc/xc_sr_common.h
@@ -208,6 +208,20 @@ struct xc_sr_context
              /* Plain VM, or checkpoints over time. */
              bool checkpointed;

+            /* Currently buffering records between a checkpoint */
+            bool buffer_all_records;
+
+/*
+ * With Remus, we buffer the records sent by the primary at checkpoint,
+ * in case the primary will fail, we can recover from the last
+ * checkpoint state.
+ * This should be enough because primary only send dirty pages at
+ * checkpoint.


I'm not sure how it then follows that 1024 buffers is guaranteed to be
enough, unless there is something on the sending side arranging it to be
so?


There are only few records at every checkpoint in my test, mostly under 10,
probably because I don't do much operations in the Guest. I thought This limit
can be adjusted later by further testing.
Since you and Andy both have doubts on this, I have to reconsider on this,
perhaps there should be no limit. Even if the 1024 limit works for
most of the cases, there might be cases that exceed the limit. So I will
add another member 'allocated_rec_num' in the context, when the
'buffered_rec_num' exceed the 'allocated_rec_num', I will reallocate the buffer.
The initial buffer size will be 1024 records which will work for most cases.

/*
 * With Remus, we buffer the records sent by the primary at checkpoint,
 * in case the primary will fail, we can recover from the last
 * checkpoint state.
 * This should be enough for most of the cases because primary only send
 * dirty pages at checkpoint.
 */
#define DEFAULT_BUF_RECORDS 1024
            struct xc_sr_record *buffered_records;
            unsigned allocated_rec_num;
            unsigned buffered_rec_num;

+ */
+#define MAX_BUF_RECORDS 1024
+            struct xc_sr_record *buffered_records;
+            unsigned buffered_rec_num;
+
              /*
               * Xenstore and Console parameters.
               * INPUT:  evtchn & domid
diff --git a/tools/libxc/xc_sr_restore.c b/tools/libxc/xc_sr_restore.c
index 9ab5760..8468ffc 100644
--- a/tools/libxc/xc_sr_restore.c
+++ b/tools/libxc/xc_sr_restore.c
@@ -468,11 +468,69 @@ static int handle_page_data(struct xc_sr_context *ctx, 
struct xc_sr_record *rec)
      return rc;
  }

+static int process_record(struct xc_sr_context *ctx, struct xc_sr_record *rec);
+static int handle_checkpoint(struct xc_sr_context *ctx)
+{
+    xc_interface *xch = ctx->xch;
+    int rc = 0;
+    unsigned i;
+
+    if ( !ctx->restore.checkpointed )
+    {
+        ERROR("Found checkpoint in non-checkpointed stream");
+        rc = -1;


Is it usual in migrv2 to set errno as well?

+        goto err;
+    }
+
+    if ( ctx->restore.buffer_all_records )
+    {
+        IPRINTF("All records buffered");
+
+        /*
+         * We need to set buffer_all_records to false in
+         * order to process records instead of buffer records.
+         * buffer_all_records should be set back to true after
+         * we successfully processed all records.
+         */
+        ctx->restore.buffer_all_records = false;


I'm not personally a fan of changing global state in order to simulate
the action of what should be a parameter to a function.

Preferable IMHO would be to have process_record gain a parameter to
override the ctx state but become an internal helper (perhaps with a
name change) and then have API function process_record and
process_buffered_records or some such which call it in the right way.

Andy may have a differing opinion though.



.


--
Thanks,
Yang.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH Remus v5 2/2] libxc/restore: implement Remus checkpointed restore
  - From: Ian Campbell

References:
- [Xen-devel] [PATCH Remus v5 0/2] Remus support for Migration-v2
  - From: Yang Hongyang
- [Xen-devel] [PATCH Remus v5 2/2] libxc/restore: implement Remus checkpointed restore
  - From: Yang Hongyang
- Re: [Xen-devel] [PATCH Remus v5 2/2] libxc/restore: implement Remus checkpointed restore
  - From: Ian Campbell

Prev by Date: Re: [Xen-devel] [PATCH Remus v5 2/2] libxc/restore: implement Remus checkpointed restore
Next by Date: Re: [Xen-devel] [PATCH v7 07/14] x86: dynamically get/set CBM for a domain
Previous by thread: Re: [Xen-devel] [PATCH Remus v5 2/2] libxc/restore: implement Remus checkpointed restore
Next by thread: Re: [Xen-devel] [PATCH Remus v5 2/2] libxc/restore: implement Remus checkpointed restore
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.