Xen project Mailing List

Re: [MirageOS-devel] Error handling in Mirage - request for comments!

To: Daniel BÃnzli <daniel.buenzli@xxxxxxxxxxxx>

From: Thomas Leonard <talex5@xxxxxxxxx>

Date: Sat, 31 Jan 2015 21:07:22 +0000

Cc: "mirageos-devel@xxxxxxxxxxxxxxxxxxxx" <mirageos-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Sat, 31 Jan 2015 21:07:31 +0000

List-id: Developer list for MirageOS <mirageos-devel.lists.xenproject.org>

On 30 January 2015 at 22:07, Daniel BÃnzli <daniel.buenzli@xxxxxxxxxxxx> wrote: > > > Le vendredi, 30 janvier 2015 Ã 16:53, Thomas Leonard a Ãcrit : > >> What problems do you see in Lwt's error handling? >> > > [...] > >> Isn't Lwt already an error monad? Can you define "well principled" here? > > > I meant the way they deal with exceptions (and cancelation). I think they > should have lifted the error handling to value land and told users not to use > exceptions (catch them between yields if you need to use them or use code > that uses them) rather than try to cope with them. Besides, the fact that > they use exceptions to perform cancellation itself leads to further > absurdities (which seems to indicate that cancelation was afterthought, but > that's the kind of concept you need think about from day 0 in a system to get > to something), as I wrote elsewhere: > > "Lwt has both cancelable and non-cancelable threads and uses an exception for > thread cancellation. Sometimes this may lead to surprising results e.g. > Lwt.pick [t, t'] may return a cancelled thread if t terminated and t' was > cancelled." > > Lwt's combinator algebra is broken: a cancelled thread should be a neutral > element for Lwt.pick. Sure, but I don't see what difference any of this cancellation stuff makes. Layering a second error monad on top of Lwt won't un-break cancellation. Let's say someone sends me an HTTP request and I do: with_gnt (process x) For some reason, perhaps a timeout, someone cancels x (which we agree is a bad idea, but it happens). I would expect: 1. "process" probably forgets to catch the Cancelled exception and Lwt propagates it. 2. "with_gnt" releases the grant. 3. the HTTP handler logs the exception and returns a "server error" response to the caller. 4. the unikernel continues serving requests. We can agree this isn't perfect. If Cancelled had been a return type we could have handled it gracefully and retried or given the user a better message. Since it's something people often want to handle specially, Cancelled shouldn't be an exception in my view. But, that's the way it is. If we lay a second error monad on Lwt and have with_gnt and the HTTP server only match on those errors, the Cancelled exception will terminate the unikernel. That doesn't seem better to me. So, whether we continue to use Lwt.cancel or not, I don't see it as an argument for separating error handling out of Lwt. But the real problem here is that by making exceptions fatal we turn every exception used in existing OCaml code into a security vulnerability. To kill my service, an attacker only needs to find some way of triggering some code path to throw an exception. My upload service uses Int64.of_string to read the Content-Length. Currently, if you give it a non-integer length it will log an exception but continue serving requests. I consider that to be correct behaviour (for my service). You can argue that of_string should return an option or an error code (I agree), but it doesn't and there are plenty more cases like this. Some are built in and hard to remove, such as out-of-memory or division-by-zero. Many exist in libraries. - If I tried to tell the user what percentage of their file had been uploaded, they could crash my unikernel with a zero-length file. - If I accepted JSON, they could crash it with a malformed message (Yojson's from_string throws). - If I accepted XML, they could send invalid XML (even xmlm throws). - And even if the XML parser reports exceptions, what if the unicode library it uses throws? How can I write a secure (available) unikernel if every exception turns into a crash? And remember, stopping the unikernel means e.g. stopping all block device access mid-flow. For example, if the filesystem is updating the disk then we will stop part way through. Yes, a good journalling FS will recover on reboot, but letting an attacker crash it at will doesn't seem sensible. In some cases, of course, you might want a different trade off. A service holding top-secret documents probably should stop if anything throws an exception (or, at least, unplug the network device and log the problem). But the development effort will be much higher and you don't let non-privileged users touch the service at all, which simplifies things. > My point here is that I think mirage should define it's own error monad and > solve error handling in value land (w.r.t. the concurrency monads), > independently from the higher-level concurrency primitives used. By providing > appropriate combinators and other resource holding combinators that ensure a > resource is only held in a given scope and that interact with that error > monad so that we should not ever see that kind of horrible code: > > https://github.com/mirage/mirage-www/pull/274/files#diff-abb8c0c60c75065f86ff29942a483ec4R88 > > and automatically guide the programmer in doing the right thing. OK, but how? Do you have a concrete proposal? What would your error monad do differently to Lwt that would make it work where Lwt doesn't? >> I'd agree if that was "abort the current transaction (which may cover >> the whole unikernel)". Realistically, there are always going to be >> error conditions that result in exceptions that should not terminate >> it (e.g. running out of memory serving requests should only abort some >> requests, etc). > > > Then these things should not be exceptions but be threaded through an > appropriate error monad in combination with resource holding combinators to > correctly relinquish held resources. > >> If I understand your position: >> >> - Every exception raised MUST terminate the unikernel. This includes >> out-of-memory, division-by-zero, int_of_string on an out-of-range int, >> etc, in any code path. Aborting the operation (e.g. HTTP request) that >> caused the problem, logging the exception and continuing is not an >> option. > > Yes. That is, anything non recoverable. int_of_string should in fact never > have raised in the first place but have returned an option type. > >> - It is therefore acceptable for a module to leak resources and/or >> leave the system in an invalid state if it it receives an exception >> from any code it calls. > > No. If you need to recover then don't use exceptions. Use an error monad and > resource holding combinators. I'm confused that you present these as alternatives. The Lwt error monad uses exceptions as values. It seems like you need some open type at any rate, and in 4.01 exceptions are the only option. (a universal type works if you give up matching) >> I don't think exception safety is much work in most >> cases (I always try to do this in my own code, anyway). > > > If by exception safety you mean, handling exceptions correctly. Then no. > There's always the exception you forget to handle [...] That's not what I mean. I mean handling *any* exception (including ones you didn't know about). I mean using try_lwt..finally and similar that don't depend on the specific types of exceptions. In Mirage terms, it can't be any more work because you already have to handle `Unknown. Whatever you do for that, do the same for Lwt.Fail (and this is why I'd like the unify them). Pure code is always exception safe, since there can be no global changes to roll back. Otherwise, it basically just means anything that allocates an external resource must also free it (try..finally or "with_*"). >, you have to put the handlers at the right place, and there's a lack of >locality in program understanding (because exceptions may flow beyond a >handler if it's not handled by it) that's too much thinking and subtelties. >Besides there's no exhaustiveness check for exceptions. That's the reason why >I prefer to have a monad that forces you into doing the right things by type >direction. Exhaustiveness is great for concrete implementations, but that doesn't work for Mirage interfaces. We can't know what errors an implementation of KV_RO may need to report, for example. You can add `Unknown, but that defeats the point of exhaustiveness. >> I would however replace all the network codes with a generic >> (`Network_error of exn), where the exn might be e.g. a Refused >> exception with more information about why it was refused. This makes >> it easy for callers who don't care to handle them all at once (with >> Lwt.fail), allows extra network errors to be added by implementations, >> and allows attaching more details about the causes. > > I'm really not fond of this. You can use open polymorphic variants for this > or a closed one with a few known common cases and a custom one with a > printable universal type. But then it seems you are designing with the idea > of using Lwt's failed state as an error mechanism which I disagree with. This is possible, but there are trade-offs: 1. It's hard to propagate the error when you don't care (99% of the cases I suspect). How would you rewrite the bad code you linked above in mirage-www with this scheme? 2. How do you attach extra detail to the cases (a universal 'cause' nested in each one)? 3. You can't match on a printable universal type even if you know the concrete implementation. Anyway, it would be great if you could write up your proposal so I can add it to the list. These schemes all have tradeoffs and it's hard to see them all together in an email thread. -- Dr Thomas Leonard http://0install.net/ GPG: 9242 9807 C985 3C07 44A6 8B9A AE07 8280 59A5 3CC1 GPG: DA98 25AE CAD0 8975 7CDA BD8E 0713 3F96 CA74 D8BA _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.