Xen project Mailing List

Re: [MirageOS-devel] Error handling in Mirage - request for comments!

To: Christophe Troestler <Christophe.Troestler@xxxxxxxxxxx>

From: Thomas Leonard <talex5@xxxxxxxxx>

Date: Wed, 4 Feb 2015 11:52:28 +0000

Cc: "mirageos-devel@xxxxxxxxxxxxxxxxxxxx" <mirageos-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 04 Feb 2015 11:52:33 +0000

List-id: Developer list for MirageOS <mirageos-devel.lists.xenproject.org>

On 3 February 2015 at 23:16, Christophe Troestler <Christophe.Troestler@xxxxxxxxxxx> wrote: > On Mon, 2 Feb 2015 12:30:39 +0000, Thomas Leonard wrote: >> >> On 2 February 2015 at 00:45, Christophe Troestler >> <Christophe.Troestler@xxxxxxxxxxx> wrote: >> > On Sat, 31 Jan 2015 21:07:22 +0000, Thomas Leonard wrote: >> >> >> >> [â] But the real problem here is that by making exceptions fatal we >> >> turn every exception used in existing OCaml code into a security >> >> vulnerability. [â] >> > >> > Since OCaml supports exceptions, I think there is no doubt that we >> > must handle them effectively â not just wipe them under the carpet and >> > let them abort the program/unikernel. Ideally Lwt should deal >> > sensibly with exceptions (and the discussion seems to imply there are >> > areas of improvement â but it is unclear to me how you can do that >> > efficiently without using a try block in all binds). >> >> Lwt already puts a try block in all (sleeping) binds, so this is >> handled for you: >> >> let bind t f = >> let t = repr t in >> match t.state with >> | Return v -> >> f v >> | Fail _ as state -> >> thread { state } >> | Sleep sleeper -> >> let res = temp t in >> let data = !current_data in >> add_immutable_waiter sleeper >> (function >> | Return v -> current_data := data; connect res (try f v >> with exn -> fail exn) >> | Fail _ as state -> fast_connect res state >> | _ -> assert false); >> res > > Thanks for checking. At first, the difference of treatment between > the two âf vâ is a little odd. Do you happen to know the reason for > it? I assume it's a speed optimisation, avoiding setting up an exception handler in the fast path but at the cost of making (raised) exceptions less predictable. Whether this trade-off is worth it, I don't know. The case where it makes a difference is when you start a thread without waiting for it and then get on with something else: let process x = let a = (x >>= f) in stuff () >>= fun () -> a If x is already returned *and* f immediately throws (rather than using fail) then we abort before calling stuff, otherwise we call it and then resolve to a failed thread later. Either behaviour is "safe", in that we don't leak resources. The case we really need to avoid is where we return a thread that will never resolve, since anything that waits on that while holding some resource will leak it. Lwt.bind only creates such a thread (res) in the Sleep case, so only that try..with is vital. >> > There are however errors that you want to mandate that they are dealt >> > with but not necessarily at the calling point. A typical example to >> > me is âreadâ because it is often much "higher" in the program that an >> > appropriate response to a failing read can be made â this is important >> > for libraries that will post-process âreadâs and will have no idea >> > what to do if âreadâ fails. >> >> Yes, this is a case where you might want an Or_error.t threaded >> through everything. Or, it might be easier to wrap the whole thing >> so that any error code returned by the read is thrown and then >> caught again. > > The problem with Or_error.t is that you have to write tedious code to > propagate it at every bind. If we adopt this, an bind infix operator > that propagates the error automatically is really needed (trying to > catch with today's messages, I see that >>=? is already used for that > purpose). Also, with Or_error.t (or exceptions), there is no way to > ensure that the error is eventually acted upon so why not to simply > use âfailâ? That makes sense to me. Arguably, Lwt's >>= is already the error-propagating >>=?, and for people who want to handle errors we should provide an operator that exposes Fail and Return using a Or_error.t. >> The problem we have in Mirage is that our types are abstract. For >> example, if you have a functor >> >> module FS (B : BLOCK) = ... >> >> then the author of the FS functor doesn't know exactly what errors the >> concrete B might throw. For example, if B is a remote disk then it >> might throw network errors. So ensuring that callers enumerate all >> errors isn't very helpful in this case. >> >> You can give FS an error code of [`Block_error of B.error] and just >> propagate it to the caller, which is what we do now. That seems quite >> clever, but I'm struggling to imagine a case where it's actually >> useful. > > That's what I was going to suggest... Declaring several `Block_error > if necessary allows to act on "classes" of errors â which is all one > can hope for given the abstraction. > >> It's probably worth compiling a list of examples where Mirage code >> currently matches on specific returned Error codes and takes specific >> actions. My guess is there aren't enough to justify the complexity >> though. > > That would be nice. I don't have much familiarity with a lot of parts > of Mirage so I may be missing use cases. As I understand the > discussion, Mirage code should be revisited, not only to see how > things are done now but also what the problems are (with the current > scheme) and what properties of the error handling we would like in > these cases. Might be useful to add some test cases to the documentation. For example, trying to connect to a disk now results in e.g. Fatal error: exception Failure("block1") Ideally, I'd want something like: Block.connect: unable to match '/home/mirage/mirage-skeleton/block/disk.img' to any available Xen disk (available: 'xvda1') ... attempting to connect "block1" Currently, we work around this using "printf" calls in libraries, so hopefully the user will see the real cause in the console output, but that's not ideal. In a busy system, trying to correlate messages logged at various points in the log files isn't fun. >> [...] in every one of these cases where we match on an `Error, we just >> propagate it (i.e. the default behaviour of an exception would have >> been correct). >> >> Here are some examples: >> >> | `Error (`Invalid_console x) -> fail (Failure (Printf.sprintf >> "Invalid_console %s" x)) >> | `Error (`Block_device e) -> fail (Failure (Fs.string_of_block_error e)) >> | `Error (KV.Unknown_key key) -> fail (Invalid_argument key) >> | `Error (`Unknown msg) -> Lwt.fail (Failure msg) > > I don't understand the point of this type of code. Why wasn't âErrorâ > declared as an exception (possibly with an exception printer) and > âfailâ directly invoked on it? This also highlights that people shy > away from having to handle even a simple `Error to the point that it > seems preferable to degenerate it as a string! Yes, this is the kind of code I'd like to get rid of. >> The only case where we do something interesting is `Error (Unknown_key >> _), where we note that it was a read: >> >> | `Error (TMPL.Unknown_key _) -> fail (Failure ("read " ^ name)) >> >> However, Unknown_key wouldn't be an exception under any proposed scheme >> anyway > > In the previous discussion, I had the impression that having a scheme > to propagate errors seamlessly while forcing certain (classes of) > errors to be handled before they reach the main loop was important. > If we do not care about this, what is wrong with the âfailâ mechanism > of Lwt? It's not yet clear whether we should care about it or not. My current feeling is that we shouldn't, and should instead adopt an Erlang-style "let it crash" policy, with a generic exception handler in the HTTP server and any similar polling loops (which appears to be effectively what we're doing anyway, since we always seem to use an exception in the end). However, there is an argument that it is more correct to consider errors where they occur and decide how to handle them at each point. If there's a way to let people who want to catch and handle individual errors do so reliably, without adding so much boilerplate that we add more bugs than we remove, I'm happy to hear it. There are some thing that should certainly be error codes (Unknown_key) and some things that I think certainly should be exceptions (`Unknown), but the ones in the middle (`Disconnected) are less obvious. Probably it doesn't matter too much. If Disconnected is an error code it's not a big problem, provided I can easily turn it into an exception if it really isn't a case I want to handle. And if it's an exception then that's fine, as long as I can easily catch it and handle it. -- Dr Thomas Leonard http://0install.net/ GPG: 9242 9807 C985 3C07 44A6 8B9A AE07 8280 59A5 3CC1 GPG: DA98 25AE CAD0 8975 7CDA BD8E 0713 3F96 CA74 D8BA _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.