[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] Error handling in Mirage - request for comments!



On 30 January 2015 at 22:07, Daniel BÃnzli <daniel.buenzli@xxxxxxxxxxxx> wrote:
>
>
> Le vendredi, 30 janvier 2015 Ã 16:53, Thomas Leonard a Ãcrit :
>
>> What problems do you see in Lwt's error handling?
>>
>
> [...]
>
>> Isn't Lwt already an error monad? Can you define "well principled" here?
>
>
> I meant the way they deal with exceptions (and cancelation). I think they 
> should have lifted the error handling to value land and told users not to use 
> exceptions (catch them between yields if you need to use them or use code 
> that uses them) rather than try to cope with them. Besides, the fact that 
> they use exceptions to perform cancellation itself leads to further 
> absurdities (which seems to indicate that cancelation was afterthought, but 
> that's the kind of concept you need think about from day 0 in a system to get 
> to something), as I wrote elsewhere:
>
> "Lwt has both cancelable and non-cancelable threads and uses an exception for 
> thread cancellation. Sometimes this may lead to surprising results e.g. 
> Lwt.pick [t, t'] may return a cancelled thread if t terminated and t' was 
> cancelled."
>
> Lwt's combinator algebra is broken: a cancelled thread should be a neutral 
> element for Lwt.pick.

Sure, but I don't see what difference any of this cancellation stuff
makes. Layering a second error monad on top of Lwt won't un-break
cancellation.

Let's say someone sends me an HTTP request and I do:

  with_gnt (process x)

For some reason, perhaps a timeout, someone cancels x (which we agree
is a bad idea, but it happens). I would expect:

1. "process" probably forgets to catch the Cancelled exception and Lwt
propagates it.
2. "with_gnt" releases the grant.
3. the HTTP handler logs the exception and returns a "server error"
response to the caller.
4. the unikernel continues serving requests.

We can agree this isn't perfect. If Cancelled had been a return type
we could have handled it gracefully and retried or given the user a
better message. Since it's something people often want to handle
specially, Cancelled shouldn't be an exception in my view. But, that's
the way it is.

If we lay a second error monad on Lwt and have with_gnt and the HTTP
server only match on those errors, the Cancelled exception will
terminate the unikernel. That doesn't seem better to me.

So, whether we continue to use Lwt.cancel or not, I don't see it as an
argument for separating error handling out of Lwt.

But the real problem here is that by making exceptions fatal we turn
every exception used in existing OCaml code into a security
vulnerability. To kill my service, an attacker only needs to find some
way of triggering some code path to throw an exception.

My upload service uses Int64.of_string to read the Content-Length.
Currently, if you give it a non-integer length it will log an
exception but continue serving requests. I consider that to be correct
behaviour (for my service).

You can argue that of_string should return an option or an error code
(I agree), but it doesn't and there are plenty more cases like this.
Some are built in and hard to remove, such as out-of-memory or
division-by-zero. Many exist in libraries.

- If I tried to tell the user what percentage of their file had been
uploaded, they could crash my unikernel with a zero-length file.

- If I accepted JSON, they could crash it with a malformed message (Yojson's
from_string throws).

- If I accepted XML, they could send invalid XML (even xmlm throws).

- And even if the XML parser reports exceptions, what if the unicode
library it uses throws?

How can I write a secure (available) unikernel if every exception
turns into a crash?

And remember, stopping the unikernel means e.g. stopping all block
device access mid-flow. For example, if the filesystem is updating the
disk then we will stop part way through. Yes, a good journalling FS
will recover on reboot, but letting an attacker crash it at will
doesn't seem sensible.

In some cases, of course, you might want a different trade off. A
service holding top-secret documents probably should stop if anything
throws an exception (or, at least, unplug the network device and log
the problem). But the development effort will be much higher and you
don't let non-privileged users touch the service at all, which
simplifies things.

> My point here is that I think mirage should define it's own error monad and 
> solve error handling in value land (w.r.t. the concurrency monads), 
> independently from the higher-level concurrency primitives used. By providing 
> appropriate combinators and other resource holding combinators that ensure a 
> resource is only held in a given scope and that interact with that error 
> monad so that we should not ever see that kind of horrible code:
>
> https://github.com/mirage/mirage-www/pull/274/files#diff-abb8c0c60c75065f86ff29942a483ec4R88
>
> and automatically guide the programmer in doing the right thing.

OK, but how? Do you have a concrete proposal? What would your error
monad do differently to Lwt that would make it work where Lwt doesn't?

>> I'd agree if that was "abort the current transaction (which may cover
>> the whole unikernel)". Realistically, there are always going to be
>> error conditions that result in exceptions that should not terminate
>> it (e.g. running out of memory serving requests should only abort some
>> requests, etc).
>
>
> Then these things should not be exceptions but be threaded through an 
> appropriate error monad in combination with resource holding combinators to 
> correctly relinquish held resources.
>
>> If I understand your position:
>>
>> - Every exception raised MUST terminate the unikernel. This includes
>> out-of-memory, division-by-zero, int_of_string on an out-of-range int,
>> etc, in any code path. Aborting the operation (e.g. HTTP request) that
>> caused the problem, logging the exception and continuing is not an
>> option.
>
> Yes. That is, anything non recoverable. int_of_string should in fact never 
> have raised in the first place but have returned an option type.
>
>> - It is therefore acceptable for a module to leak resources and/or
>> leave the system in an invalid state if it it receives an exception
>> from any code it calls.
>
> No. If you need to recover then don't use exceptions. Use an error monad and 
> resource holding combinators.

I'm confused that you present these as alternatives. The Lwt error
monad uses exceptions as values. It seems like you need some open type
at any rate, and in 4.01 exceptions are the only option. (a universal
type works if you give up matching)

>> I don't think exception safety is much work in most
>> cases (I always try to do this in my own code, anyway).
>
>
> If by exception safety you mean, handling exceptions correctly. Then no. 
> There's always the exception you forget to handle [...]

That's not what I mean. I mean handling *any* exception (including
ones you didn't know about). I mean using try_lwt..finally and similar
that don't depend on the specific types of exceptions.

In Mirage terms, it can't be any more work because you already have to
handle `Unknown. Whatever you do for that, do the same for Lwt.Fail
(and this is why I'd like the unify them).

Pure code is always exception safe, since there can be no global
changes to roll back. Otherwise, it basically just means anything that
allocates an external resource must also free it (try..finally or
"with_*").

>, you have to put the handlers at the right place, and there's a lack of 
>locality in program understanding (because exceptions may flow beyond a 
>handler if it's not handled by it) that's too much thinking and subtelties. 
>Besides there's no exhaustiveness check for exceptions. That's the reason why 
>I prefer to have a monad that forces you into doing the right things by type 
>direction.

Exhaustiveness is great for concrete implementations, but that doesn't
work for Mirage interfaces. We can't know what errors an
implementation of KV_RO may need to report, for example. You can add
`Unknown, but that defeats the point of exhaustiveness.

>> I would however replace all the network codes with a generic
>> (`Network_error of exn), where the exn might be e.g. a Refused
>> exception with more information about why it was refused. This makes
>> it easy for callers who don't care to handle them all at once (with
>> Lwt.fail), allows extra network errors to be added by implementations,
>> and allows attaching more details about the causes.
>
> I'm really not fond of this. You can use open polymorphic variants for this 
> or a closed one with a few known common cases and a custom one with a 
> printable universal type. But then it seems you are designing with the idea 
> of using Lwt's failed state as an error mechanism which I disagree with.

This is possible, but there are trade-offs:

1. It's hard to propagate the error when you don't care (99% of the
cases I suspect). How would you rewrite the bad code you linked above
in mirage-www with this scheme?

2. How do you attach extra detail to the cases (a universal 'cause'
nested in each one)?

3. You can't match on a printable universal type even if you know the
concrete implementation.

Anyway, it would be great if you could write up your proposal so I can
add it to the list. These schemes all have tradeoffs and it's hard to
see them all together in an email thread.


-- 
Dr Thomas Leonard        http://0install.net/
GPG: 9242 9807 C985 3C07 44A6  8B9A AE07 8280 59A5 3CC1
GPG: DA98 25AE CAD0 8975 7CDA  BD8E 0713 3F96 CA74 D8BA

_______________________________________________
MirageOS-devel mailing list
MirageOS-devel@xxxxxxxxxxxxxxxxxxxx
http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.