Xen project Mailing List

Re: [Xen-devel] [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]

To: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

From: Shriram Rajagopalan <rshriram@xxxxxxxxx>

Date: Tue, 12 Nov 2013 10:24:09 -0600

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 12 Nov 2013 16:25:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Nov 12, 2013 at 9:38 AM, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> wrote:

Shriram Rajagopalan writes ("Re: [PATCH 4 of 5 V3] tools/libxl: Control network buffering in remus callbacks [and 1 more messages] [and 1 more messages]"):

> On Mon, Nov 4, 2013 at 10:40 AM, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> wrote:

...

> The patch seems harmless enough. How do you want to go about this?

If you think this is the right approach, you should start on it right
away based on my patch. The reason I posted it quickly was because I
wanted to unblock you.

The nested-ao patch makes sense for Remus, even without fixing this timeout issue.

I can modify my stuff accordingly. Probably create a nested-ao per iteration and drop

it at the start of the next iteration.

However, the timeout part is not convincing enough. For example,

libxl__domain_suspend_common_callback [the version before your patch]

has two 6 second wait loops in the worst case..

LOG(DEBUG, "issuing %s suspend request via XenBus control node",

dss->hvm ? "PVHVM" : "PV");

libxl__domain_pvcontrol_write(gc, XBT_NULL, domid, "suspend");

LOG(DEBUG, "wait for the guest to acknowledge suspend request");

watchdog = 60;

while (!strcmp(state, "suspend") && watchdog > 0) {

usleep(100000);

state = libxl__domain_pvcontrol_read(gc, XBT_NULL, domid);

if (!state) state = "";

watchdog--;

}

and then once again

LOG(DEBUG, "wait for the guest to suspend");

watchdog = 60;

while (watchdog > 0) {

xc_domaininfo_t info;

usleep(100000);

ret = xc_domain_getinfolist(CTX->xch, domid, 1, &info);

Now I know where the 200ms overhead per checkpoint comes from.

Shouldn't this also be made into an event loop? Irrespective of whether it is invoked in

Remus' context or normal suspend/resume/save/restore/migrate context.

And if this remains, the Remus checkpoint interval is much lower compared to this.

Typically between 25-100ms.

> Do you want to post/commit this patch ? Because I have to modify my
> patches accordingly. Or should I post this along with my patch
> series, avoiding the need to wait on you before I post mine ?

You can certainly take it into the start of your patch series, yes.
Like what Roger did with the nested ao patch.

Currently there isn't any other reason to make the change in this
patch, so I don't think it should be committed right away. But if for
some reason it does get committed to staging, you or we can just drop
it from the start of your series.

The only reason it might get committed to staging without other remus patches

would be to fix the issue I cited above.

cheers

shriram

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel