[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] Error handling in Mirage - request for comments!

On 3 February 2015 at 23:16, Christophe Troestler
<Christophe.Troestler@xxxxxxxxxxx> wrote:
> On Mon, 2 Feb 2015 12:30:39 +0000, Thomas Leonard wrote:
>> On 2 February 2015 at 00:45, Christophe Troestler
>> <Christophe.Troestler@xxxxxxxxxxx> wrote:
>> > On Sat, 31 Jan 2015 21:07:22 +0000, Thomas Leonard wrote:
>> >>
>> >> [â] But the real problem here is that by making exceptions fatal we
>> >> turn every exception used in existing OCaml code into a security
>> >> vulnerability. [â]
>> >
>> > Since OCaml supports exceptions, I think there is no doubt that we
>> > must handle them effectively â not just wipe them under the carpet and
>> > let them abort the program/unikernel.  Ideally Lwt should deal
>> > sensibly with exceptions (and the discussion seems to imply there are
>> > areas of improvement â but it is unclear to me how you can do that
>> > efficiently without using a try block in all binds).
>> Lwt already puts a try block in all (sleeping) binds, so this is
>> handled for you:
>> let bind t f =
>>   let t = repr t in
>>   match t.state with
>>     | Return v ->
>>         f v
>>     | Fail _ as state ->
>>         thread { state }
>>     | Sleep sleeper ->
>>         let res = temp t in
>>         let data = !current_data in
>>         add_immutable_waiter sleeper
>>           (function
>>              | Return v -> current_data := data; connect res (try f v
>> with exn -> fail exn)
>>              | Fail _ as state -> fast_connect res state
>>              | _ -> assert false);
>>         res
> Thanks for checking.  At first, the difference of treatment between
> the two âf vâ is a little odd.  Do you happen to know the reason for
> it?

I assume it's a speed optimisation, avoiding setting up an exception
handler in the fast path but at the cost of making (raised) exceptions
less predictable. Whether this trade-off is worth it, I don't know.

The case where it makes a difference is when you start a thread
without waiting for it and then get on with something else:

let process x =
  let a = (x >>= f) in
  stuff () >>= fun () ->

If x is already returned *and* f immediately throws (rather than using
fail) then we abort before calling stuff, otherwise we call it and
then resolve to a failed thread later.

Either behaviour is "safe", in that we don't leak resources. The case
we really need to avoid is where we return a thread that will never
resolve, since anything that waits on that while holding some resource
will leak it. Lwt.bind only creates such a thread (res) in the Sleep
case, so only that try..with is vital.

>> > There are however errors that you want to mandate that they are dealt
>> > with but not necessarily at the calling point.  A typical example to
>> > me is âreadâ because it is often much "higher" in the program that an
>> > appropriate response to a failing read can be made â this is important
>> > for libraries that will post-process âreadâs and will have no idea
>> > what to do if âreadâ fails.
>> Yes, this is a case where you might want an Or_error.t threaded
>> through everything. Or, it might be easier to wrap the whole thing
>> so that any error code returned by the read is thrown and then
>> caught again.
> The problem with Or_error.t is that you have to write tedious code to
> propagate it at every bind.  If we adopt this, an bind infix operator
> that propagates the error automatically is really needed (trying to
> catch with today's messages, I see that >>=? is already used for that
> purpose).  Also, with Or_error.t (or exceptions), there is no way to
> ensure that the error is eventually acted upon so why not to simply
> use âfailâ?

That makes sense to me. Arguably, Lwt's >>= is already the
error-propagating >>=?, and for people who want to handle errors we
should provide an operator that exposes Fail and Return using a

>> The problem we have in Mirage is that our types are abstract. For
>> example, if you have a functor
>>   module FS (B : BLOCK)  = ...
>> then the author of the FS functor doesn't know exactly what errors the
>> concrete B might throw. For example, if B is a remote disk then it
>> might throw network errors. So ensuring that callers enumerate all
>> errors isn't very helpful in this case.
>> You can give FS an error code of [`Block_error of B.error] and just
>> propagate it to the caller, which is what we do now. That seems quite
>> clever, but I'm struggling to imagine a case where it's actually
>> useful.
> That's what I was going to suggest...  Declaring several `Block_error
> if necessary allows to act on "classes" of errors â which is all one
> can hope for given the abstraction.
>> It's probably worth compiling a list of examples where Mirage code
>> currently matches on specific returned Error codes and takes specific
>> actions. My guess is there aren't enough to justify the complexity
>> though.
> That would be nice.  I don't have much familiarity with a lot of parts
> of Mirage so I may be missing use cases.  As I understand the
> discussion, Mirage code should be revisited, not only to see how
> things are done now but also what the problems are (with the current
> scheme) and what properties of the error handling we would like in
> these cases.

Might be useful to add some test cases to the documentation. For
example, trying to connect to a disk now results in e.g.

Fatal error: exception Failure("block1")

Ideally, I'd want something like:

Block.connect: unable to match
'/home/mirage/mirage-skeleton/block/disk.img' to any available Xen
disk (available: 'xvda1')
... attempting to connect "block1"

Currently, we work around this using "printf" calls in libraries, so
hopefully the user will see the real cause in the console output, but
that's not ideal. In a busy system, trying to correlate messages
logged at various points in the log files isn't fun.

>> [...] in every one of these cases where we match on an `Error, we just
>> propagate it (i.e. the default behaviour of an exception would have
>> been correct).
>> Here are some examples:
>> | `Error (`Invalid_console x) -> fail (Failure (Printf.sprintf 
>> "Invalid_console %s" x))
>> | `Error (`Block_device e) -> fail (Failure (Fs.string_of_block_error e))
>> | `Error (KV.Unknown_key key) -> fail (Invalid_argument key)
>> | `Error (`Unknown msg) -> Lwt.fail (Failure msg)
> I don't understand the point of this type of code.  Why wasn't âErrorâ
> declared as an exception (possibly with an exception printer) and
> âfailâ directly invoked on it?  This also highlights that people shy
> away from having to handle even a simple `Error to the point that it
> seems preferable to degenerate it as a string!

Yes, this is the kind of code I'd like to get rid of.

>> The only case where we do something interesting is `Error (Unknown_key
>> _), where we note that it was a read:
>> | `Error (TMPL.Unknown_key _) -> fail (Failure ("read " ^ name))
>> However, Unknown_key wouldn't be an exception under any proposed scheme 
>> anyway
> In the previous discussion, I had the impression that having a scheme
> to propagate errors seamlessly while forcing certain (classes of)
> errors to be handled before they reach the main loop was important.
> If we do not care about this, what is wrong with the âfailâ mechanism
> of Lwt?

It's not yet clear whether we should care about it or not. My current
feeling is that we shouldn't, and should instead adopt an Erlang-style
"let it crash" policy, with a generic exception handler in the HTTP
server and any similar polling loops (which appears to be effectively
what we're doing anyway, since we always seem to use an exception in
the end).

However, there is an argument that it is more correct to consider
errors where they occur and decide how to handle them at each point.
If there's a way to let people who want to catch and handle individual
errors do so reliably, without adding so much boilerplate that we add
more bugs than we remove, I'm happy to hear it.

There are some thing that should certainly be error codes
(Unknown_key) and some things that I think certainly should be
exceptions (`Unknown), but the ones in the middle (`Disconnected) are
less obvious.

Probably it doesn't matter too much. If Disconnected is an error code
it's not a big problem, provided I can easily turn it into an
exception if it really isn't a case I want to handle. And if it's an
exception then that's fine, as long as I can easily catch it and
handle it.

Dr Thomas Leonard        http://0install.net/
GPG: 9242 9807 C985 3C07 44A6  8B9A AE07 8280 59A5 3CC1
GPG: DA98 25AE CAD0 8975 7CDA  BD8E 0713 3F96 CA74 D8BA

MirageOS-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.