[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-API] timing loops


  • To: xen-api@xxxxxxxxxxxxx
  • From: George Shuklin <george.shuklin@xxxxxxxxx>
  • Date: Wed, 11 Jul 2012 00:07:14 +0400
  • Delivery-date: Tue, 10 Jul 2012 20:07:35 +0000
  • List-id: User and development list for XCP and XAPI <xen-api.lists.xen.org>

Well, that stuff was the main reason we build additional layer of 'out of xapi tools' in dom0. F.e. right now we use 'absolute kill' function allowing to kill domain without any queue in long list of timeout-waiting graceful shutdown requests to domain...

And about cancellation of SR stuff...
There is a set of scenarios to think about.

1. Normally short operation performed in normal mode. We can simply says 'no cancel if execution time is less than X'. Means if we quickly unplug VBD, all is fine and user have no chance to cancel operation. (If by luck user succeed, we can say 'oops, your request was too late'). Simple implementation: if operation is 'normally quick' we will wait for small timeout before processing cancellation request. If operation success to that time, nothing to cancel, everything is fine. If operation is still in progress, see #3. 2. Normally long operation, performed in normal mode. We wants to cancel vdi-copy f.e.. I think this can be easily done by sending kill to 'spare_dd' and removing new VDI.

Now hard part.

Before proposing of behavior, really bad scenario I saw in my XCP practice: Storage server offline, PBD is still plugged. There is no way to say 'lvchange' for LVM with unplugged PV, doing something with NFS without NFS server and so on. We can not do VBD-unplug, PBD-unplug and so on. Situation getting worse if we getting stuck with innocent SR (f.e. VM reboots) while dealing with died (or dying) SR attached to same host. F.e. I saw that once with dying SFP with mass error rate. I was unable to migrate domain away, no shutdown and my single solution was reboot host and do manual power-state reset (or, actually, wait that host goes back online and mark those machines as 'down').

I think SM's should provide some way to say 'nope, that stuff is dead' and allows forcefully VBD/PBD unplug operations.

That require conception of 'compromised host'. That host is normally do not accept new VMs, every VM going to reboot is actually shutdowned (see below) and started on next available non-compromised host. Only two operations is allowed for VM (except casual memory-set/rename and so on):
1) Shutdown/reboot (which actually restart VM on different host)
2) Urgent migration.

Both of them do have different behavior compare to normal operations: they did not _DESTROY_ domain (unplug tapdisk and so on). They trying destroy domain and put it to 'pause' state if this is not possible (f.e. hanged tapdisk does not free shared memory or my any other way prevents that VM real disappearance). Those 'paused' domains changes they UUID's to 'deadbeaf' (like xapi now marks unkillable stray domains during startup). Main idea: we allow VM migration even if killing of original domain is failed. We migrate domain and putting it to endlessly paused state with '-d-'ying flag. Same for shutdown/reboot. We reports VM is 'off' even if domain is not completely died. After all domains is migrated/rebooted/shutdowned we can freely perform (even self-initiated) urgent reboot.

One more notice: during that state xapi should be able to restart and continue to operates (within compromised limits).

I have situation I got not 'dying' domain, butch of normal domains and decide to restart xapi. Of cause xapi was not able to start (found unkillable deadbeaf) and I was forced to reboot some good VMs due one bad.

Ok, now to second part.

3. Normally short operations performed too long. If that happens we can't cancel them. F.e. because lvs is simply hangs at every call and we can not to do anything with LVM. We allowing to mark task as 'forcefully canceled' only if host is marked as degradated. In this case we allows only 'liberating' calls (like reboot/shutdown/migrate) for VMs and situation is solved. In other words we reject cancellation of those operations in normal mode, but allows to simply 'forget' about them for urgent evacuation/reboot. 4. Normally long operations we can't kill. If our kill to spare_dd or other long-executed command is not success (f.e. we do sr-create of LVMoISCSI, but dd to 1st 100Mb is hanged), we marks host as 'bad'. Here we place long timeout (f.e. 30s - if program not reacts to kill -9 for 30s, it is hang is syscall) before doing this.


... And I know that stuff is dirty and ugly. But all block devices can behave as crap sometimes. F.e. not long ago in linux-raid was nasty bug, which cause raid10 to go to deadlock. Means every IO is simply go in and do not returns. Same bug is now in LVM (with large amount disks, not report it still, because could not reliably reproduce).

And virtualization platform should be able to overcome all THAT.

On 10.07.2012 18:36, Dave Scott wrote:

Hopefully in the future the whole stack will support cancellation -- so the user can apply their own timeout values in their code instead of us doing it one-size-fits-all. A lot of the domain-level stuff can now be cancelled (which may cause the domain to crash if it happens at a bad time.. but this does at least cause things to unwind usually). Most of the storage interface is uncancellable, which is a big problem since it involves off-box RPCs. We either need to fix that directly or offer users the big red button labeled "driver domain restart" which will unstick things One bad thing about not supporting cancellation is that it encourages people to close connections and walk away, unaware that a large amount of resources (and locks) are still being consumed server-side. One good thing to do would be to send heartbeats to any running CLIs and auto-cancel when the connection is broken unless some "--async" option is given which would return immediately with a Task. In the meantime we always tune the timeouts to fail eventually if the system gets truly stuck under high load. This leads to fairly long timeouts, which isn't ideal for everyone. There's a tension between high timeouts for stress testing and low timeouts for user experience -- we can't do both :( Cheers, Dave

-----Original Message----- From: Anil Madhavapeddy [mailto:anil@xxxxxxxxxx] Sent: 10 July 2012 15:24 To: Dave Scott Cc: xen-api@xxxxxxxxxxxxx Subject: Re: [Xen-API] timing loops How do you decide on a reasonable value of n, given that real timeouts shift so dramatically with dom0 system load? Or rather, what areas of xapi aren't fully event-driven and require such timeouts? I can imagine the device/udev layer being icky in this regard, but a good way to wrap all such instances might be to have a single event- dispatch daemon which combines all the system events and timeouts, and coordinates the remainder of the xapi process cluster (which will not need arbitrary timeouts as a result). Or it just too impractical since there are so many places where such timeouts are required? -anil On 10 Jul 2012, at 15:18, Dave Scott wrote:

Hi, With all the recent xapi disaggregation work, are we now more

vulnerable to failures induced by moving the system clock around, affecting timeout logic in our async-style interfaces where we wait for 'n' seconds for an event notification?

I've recently added 'oclock' as a dependency which gives us access to

a monotonic clock source, which is perfect (I believe) for reliably 'timing out'. I started a patch to convert the whole codebase over but it was getting much too big and hard to test because sometimes we really do want a calendar date, and other times we really want a point in time.

Maybe I should make a subset of my patch which fixes all the new

timing loops that have been introduced. What do you think? Would you like to confess to having written:

let start = Unix.gettimeofday () in while (not p && (Unix.gettimeofday () -. start < timeout) do

Thread.delay 1. done

I've got a nice higher-order function to replace this which does: let until p timeout interval = let start = Oclock.gettime Oclock.monotonic in while (not p && (Int64.(to_float (sub (Oclock.gettime

Oclock.monotonic) start) / 1e9) < timeout) do Thread.delay 1. Done

I believe this is one of many things that lwt (and JS core) does a

nice job of.

Cheers, Dave _______________________________________________ Xen-api mailing list Xen-api@xxxxxxxxxxxxx http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api

_______________________________________________ Xen-api mailing list Xen-api@xxxxxxxxxxxxx http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


_______________________________________________
Xen-api mailing list
Xen-api@xxxxxxxxxxxxx
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.