[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-changelog] [xen master] libxl: event: Fix hang when mixing blocking and eventy calls
commit 5a533d0ba575e3ba7536b22bc3bc156e9b8a679b Author: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> AuthorDate: Fri Jan 10 12:37:43 2020 +0000 Commit: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> CommitDate: Mon Jan 27 16:03:17 2020 +0000 libxl: event: Fix hang when mixing blocking and eventy calls If the application calls libxl with ao_how==0 and also makes calls like _occurred, libxl will sometimes get stuck. The bug happens as follows (for example): Thread A libxl_do_thing(,ao_how==0) libxl_do_thing starts, sets up some callbacks libxl_do_thing exit path calls AO_INPROGRESS libxl__ao_inprogress goes into event loop eventloop_iteration sleeps on: - do_thing's current fd set - sigchld pipe if applicable - its poller Thread B libxl_something_occurred the something is to do with do_thing, above do_thing_next_callback does some more work do_thing_next_callback becomes interested in fd N thread B returns to application Note that nothing wakes up thread A. A is not listening on fd N. So do_thing_* will not spot when fd N signals. do_thing will not make further timely progress. If there is no timeout thread A will never wake up. The problem here occurs because thread A is waiting on an out of date osevent set. There is also the possibility that a thread might block waiting for libxl osevents but outside libxl, eg if the application used libxl_osevent_beforepoll. We will deal with that in a moment. See the big comment in libxl_event.c for a fairly formal correctness argument. This depends on libxl__egc_ao_cleanup_1_baton being called everywhere an egc or ao is disposed of. Firstly egcs: in this patch we rename libxl__egc_cleanup, which means we catch all the disposal sites. Secondly aos: these are disposed of by (i) AO_CREATE_FAIL (ii) ao__inprogress and (iii) an event which completes the ao later. (i) and (ii) we handle by adding the call to _baton. In the case of (iii) any such function must be an event-generating function so it has an egc too, so it will pass on the baton when the egc is disposed. Reported-by: George Dunlap <george.dunlap@xxxxxxxxxx> Signed-off-by: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> Reviewed-by: George Dunlap <george.dunlap@xxxxxxxxxx> Tested-by: George Dunlap <george.dunlap@xxxxxxxxxx> --- v2: Call libxl__egc_ao_cleanup_1_baton (renamed from __egc_cleanup) on all exits from ao_inprogress, even requests for async processing. Fixes a remaining instance of this bug (!) This involves disposing of ao->poller somewhat earlier. v2: New correctness arguments in libxl_event.c comment and in commit message. --- tools/libxl/libxl_event.c | 178 ++++++++++++++++++++++++++++++++++++++++--- tools/libxl/libxl_internal.h | 33 ++++++-- 2 files changed, 194 insertions(+), 17 deletions(-) diff --git a/tools/libxl/libxl_event.c b/tools/libxl/libxl_event.c index 268a5da120..b50d4e5074 100644 --- a/tools/libxl/libxl_event.c +++ b/tools/libxl/libxl_event.c @@ -37,6 +37,140 @@ static void ao__check_destroy(libxl_ctx *ctx, libxl__ao *ao); /* + * osevent update baton handling + * + * We need the following property (the "unstale liveness property"): + * + * Whenever any thread is blocking in the libxl event loop[1], at + * least one thread must be using an up to date osevent set. It is OK + * for all but one threads to have stale event sets, because so long + * as one waiting thread has the right event set, any actually + * interesting event will, if nothing else, wake that "right" thread + * up. It will then make some progress and/or, if it exits, ensure + * that some other thread becomes the "right" thread. + * + * [1] TODO: Right now we are considering only the libxl event loop. + * We need to consider application event loop outside libxl too. + * + * Argument that our approach is sound: + * + * The issue we are concerned about is libxl sleeping on an out of + * date fd set, or too long a timeout, so that it doesn't make + * progress. If the property above is satisfied, then if any thread + * is waiting in libxl at least one such thread will be waiting on a + * sufficient osevent set, so any relevant osevent will wake up a + * libxl thread which will either handle the event, or arrange that at + * least one other libxl thread has the right set. + * + * There are two calls to poll in libxl: one is the fd recheck, which + * is not blocking. There is only the one blocking call, in + * eventloop_iteration. poll runs with the ctx unlocked, so osevents + * might be added after it unlocks the ctx - that is what we are + * worried about. + * + * To demonstrate that the unstale liveness property is satisfied: + * + * We define a baton holder as follows: a libxl thread is a baton + * holder if + * (a) it has an egc or an ao and holds the ctx lock, or + * (b) it has an active non-app poller and no osevents have been + * added since it released the lock, or + * (c) it has an active non-app poller which has been woken + * (by writing to its pipe), so it will not sleep + * We will maintain the invariant (the "baton invariant") that + * whenever there is any active poller, there is at least + * one baton holder. ("non-app" means simply "not poller_app".) + * + * No thread outside libxl can have an active non-app poller: pollers + * are put on the active list by poller_get which is called in three + * places: libxl_event_wait, which puts it before returning; + * libxl__ao_create but only in the synchronous case, in which case + * the poller is put before returning; and the poller_app, during + * initialisation. + * + * So any time when all libxl threads are blocking (and therefore do + * not have the ctx lock), the non-app active pollers belong to those + * threads. If at least one is a baton holder (the invariant), that + * thread has a good enough event set. + * + * Now we will demonstrate that the "baton invariant" is maintained: + * + * The rule is that any thread which might be the baton holder is + * responsible for checking that there continues to be a baton holder + * as needed. + * + * Firstly, consider the case when the baton holders (b) cease to be + * baton holders because osevents are added. + * + * There are only two kinds of osevents: timeouts and fds. Every + * other internal event source reduces to one of these eventually. + * Both of these cases are handled (in the case of fd events, add and + * modify, separately), calling pollers_note_osevent_added. + * + * This walks the poller_active list, marking the active pollers + * osevents_added=1. Such a poller cannot be the baton holder. But + * pollers_note_osevent_added is called only from ev_* functions, + * which are only called from event-chain libxl code: ie, code with an + * ao or an egc. So at this point we are a baton holder, and there is + * still a baton holder. + * + * Secondly, consider the case where baton holders (a) cease to be + * batton holders because they dispose of their egc or ao. We call + * libxl__egc_ao_cleanup_1_baton on every exit path. We arrange that + * everything that disposes of an egc or an ao checks that there is a + * new baton holder by calling libxl__egc_ao_cleanup_1_baton. + * + * This function handles the invariant explicitly: if we have any + * non-app active pollers it looks for one which is up to date (baton + * holder category (b)), and failing that it picks a victim to turn + * into the baton holder category (c) by waking it up. (Correctness + * depends on this function not spotting its own thread as the + * baton-holder, since it is on its way to not being the baton-holder, + * so it must be called after the poller has been put back.) + * + * Thirdly, we must consider the case (c). A thread in category (c) + * will reenter libxl when it gains the lock and necessarily then + * becomes a baton holder in category (a). + * + * So the "baton invariant" is maintained. QED. + */ +static void pollers_note_osevent_added(libxl_ctx *ctx) { + libxl__poller *poller; + LIBXL_LIST_FOREACH(poller, &ctx->pollers_active, active_entry) + poller->osevents_added = 1; +} + +void libxl__egc_ao_cleanup_1_baton(libxl__gc *gc) + /* Any poller we had must have been `put' already. */ +{ + libxl__poller *search, *wake=0; + + LIBXL_LIST_FOREACH(search, &CTX->pollers_active, active_entry) { + if (search == CTX->poller_app) + /* This one is special. We can't give it the baton. */ + continue; + if (!search->osevents_added) + /* This poller is up to date and will wake up as needed. */ + return; + if (!wake) + wake = search; + } + + if (!wake) + /* no-one in libxl waiting for any events */ + return; + + libxl__poller_wakeup(gc, wake); + + wake->osevents_added = 0; + /* This serves to make _1_baton idempotent. It is OK even though + * that poller may currently be sleeping on only old osevents, + * because it is going to wake up because we've just prodded it, + * and it pick up new osevents on its next iteration (or pass + * on the baton). */ +} + +/* * The counter osevent_in_hook is used to ensure that the application * honours the reentrancy restriction documented in libxl_event.h. * @@ -194,6 +328,7 @@ int libxl__ev_fd_register(libxl__gc *gc, libxl__ev_fd *ev, ev->func = func; LIBXL_LIST_INSERT_HEAD(&CTX->efds, ev, entry); + pollers_note_osevent_added(CTX); rc = 0; @@ -214,6 +349,8 @@ int libxl__ev_fd_modify(libxl__gc *gc, libxl__ev_fd *ev, short events) rc = OSEVENT_HOOK(fd,modify, noop, ev->fd, &ev->nexus->for_app_reg, events); if (rc) goto out; + if ((events & ~ev->events)) + pollers_note_osevent_added(CTX); ev->events = events; rc = 0; @@ -315,6 +452,7 @@ static int time_register_finite(libxl__gc *gc, libxl__ev_time *ev, LIBXL_TAILQ_INSERT_SORTED(&CTX->etimes, entry, ev, evsearch, /*empty*/, timercmp(&ev->abs, &evsearch->abs, >)); + pollers_note_osevent_added(CTX); return 0; } @@ -1121,6 +1259,7 @@ static int beforepoll_internal(libxl__gc *gc, libxl__poller *poller, *nfds_io = used; poller->fds_deregistered = 0; + poller->osevents_added = 0; libxl__ev_time *etime = LIBXL_TAILQ_FIRST(&CTX->etimes); if (etime) { @@ -1442,7 +1581,7 @@ static void egc_run_callbacks(libxl__egc *egc) } } -void libxl__egc_cleanup(libxl__egc *egc) +void libxl__egc_cleanup_2_ul_cb_gc(libxl__egc *egc) { EGC_GC; egc_run_callbacks(egc); @@ -1752,13 +1891,15 @@ int libxl_event_wait(libxl_ctx *ctx, libxl_event **event_r, rc = eventloop_iteration(egc, poller); if (rc) goto out; - /* we unlock and cleanup the egc each time we go through this loop, - * so that (a) we don't accumulate garbage and (b) any events - * which are to be dispatched by callback are actually delivered - * in a timely fashion. + /* we unlock and cleanup the egc each time we go through this + * loop, so that (a) we don't accumulate garbage and (b) any + * events which are to be dispatched by callback are actually + * delivered in a timely fashion. _1_baton will be + * called to pass the baton iff we actually leave; otherwise + * we are still carrying it. */ CTX_UNLOCK; - libxl__egc_cleanup(egc); + libxl__egc_cleanup_2_ul_cb_gc(egc); CTX_LOCK; } @@ -2031,14 +2172,24 @@ int libxl__ao_inprogress(libxl__ao *ao, * synchronous cancellation ability. */ } + /* The call to egc..1_baton is below, only if we are leaving. */ CTX_UNLOCK; - libxl__egc_cleanup(&egc); + libxl__egc_cleanup_2_ul_cb_gc(&egc); CTX_LOCK; } + + /* Dispose of this early so libxl__egc_ao_cleanup_1_baton + * doesn't mistake us for a baton-holder. No-one much is + * going to look at this ao now so setting this to 0 is fine. + * We can't call _baton below _leave because _leave destroys + * our gc, which _baton needs. */ + libxl__poller_put(CTX, ao->poller); + ao->poller = 0; } else { rc = 0; } + libxl__egc_ao_cleanup_1_baton(gc); ao->in_initiator = 0; ao__manip_leave(CTX, ao); @@ -2051,6 +2202,9 @@ int libxl__ao_inprogress(libxl__ao *ao, static int ao__abort(libxl_ctx *ctx, libxl__ao *parent) /* Temporarily unlocks ctx, which must be locked exactly once on entry. */ { + libxl__egc egc; + LIBXL_INIT_EGC(egc,ctx); + int rc; ao__manip_enter(parent); @@ -2071,9 +2225,6 @@ static int ao__abort(libxl_ctx *ctx, libxl__ao *parent) /* We keep calling abort hooks until there are none left */ while (!LIBXL_LIST_EMPTY(&parent->abortables)) { - libxl__egc egc; - LIBXL_INIT_EGC(egc,ctx); - assert(!parent->complete); libxl__ao_abortable *abrt = LIBXL_LIST_FIRST(&parent->abortables); @@ -2086,15 +2237,20 @@ static int ao__abort(libxl_ctx *ctx, libxl__ao *parent) "ao %p: abrt=%p: aborting", parent, abrt->ao); abrt->callback(&egc, abrt, ERROR_ABORTED); + /* The call to egc..1_baton is in the out block below. */ libxl__ctx_unlock(ctx); - libxl__egc_cleanup(&egc); + libxl__egc_cleanup_2_ul_cb_gc(&egc); libxl__ctx_lock(ctx); } rc = 0; out: + libxl__egc_ao_cleanup_1_baton(&egc.gc); ao__manip_leave(ctx, parent); + /* The call to egc..2_ul_cb_gc is above. This is sufficient + * because only code inside the loop adds anything to the egc, and + * we ensures that the egc is clean when we leave the loop. */ return rc; } diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 0c8ed8d9f6..f2ff5e6c2d 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -634,9 +634,23 @@ struct libxl__poller { * event is deregistered, we set the fds_deregistered of all non-idle * pollers. So afterpoll can tell whether any POLLNVAL is * plausibly due to an fd being closed and reopened. + * + * Additionally, we record whether any fd or time event sources + * have been registered. This is necessary because sometimes we + * need to wake up the only libxl thread stuck in + * eventloop_iteration so that it will pick up new fds or earlier + * timeouts. osevents_added is cleared by beforepoll, and set by + * fd or timeout event registration. When we are about to leave + * libxl (strictly, when we are about to give up an egc), we check + * whether there are any pollers. If there are, then at least one + * of them must have osevents_added clear. If not, we wake up the + * first one on the list. Any entry on pollers_active constitutes + * a promise to also make this check, so the baton will never be + * dropped. */ LIBXL_LIST_ENTRY(libxl__poller) active_entry; bool fds_deregistered; + bool osevents_added; }; struct libxl__gc { @@ -2350,7 +2364,10 @@ _hidden libxl_device_model_version libxl__default_device_model(libxl__gc *gc); LIBXL_STAILQ_INIT(&(egc).ev_immediates); \ } while(0) -_hidden void libxl__egc_cleanup(libxl__egc *egc); +_hidden void libxl__egc_ao_cleanup_1_baton(libxl__gc *gc); + /* Passes the baton for added osevents. See comment for + * osevents_added in struct libxl__poller. */ +_hidden void libxl__egc_cleanup_2_ul_cb_gc(libxl__egc *egc); /* Frees memory allocated within this egc's gc, and and report all * occurred events via callback, if applicable. May reenter the * application; see restrictions above. The ctx must be UNLOCKED. */ @@ -2361,9 +2378,11 @@ _hidden void libxl__egc_cleanup(libxl__egc *egc); libxl__egc egc[1]; LIBXL_INIT_EGC(egc[0],ctx); \ EGC_GC -#define EGC_FREE libxl__egc_cleanup(egc) - -#define CTX_UNLOCK_EGC_FREE do{ CTX_UNLOCK; EGC_FREE; }while(0) +#define CTX_UNLOCK_EGC_FREE do{ \ + libxl__egc_ao_cleanup_1_baton(&egc->gc); \ + CTX_UNLOCK; \ + libxl__egc_cleanup_2_ul_cb_gc(egc); \ + }while(0) /* @@ -2468,8 +2487,9 @@ _hidden void libxl__egc_cleanup(libxl__egc *egc); #define AO_INPROGRESS ({ \ libxl_ctx *ao__ctx = libxl__gc_owner(&ao->gc); \ + /* __ao_inprogress will do egc..1_baton if needed */ \ CTX_UNLOCK; \ - EGC_FREE; \ + libxl__egc_cleanup_2_ul_cb_gc(egc); \ CTX_LOCK; \ int ao__rc = libxl__ao_inprogress(ao, \ __FILE__, __LINE__, __func__); \ @@ -2481,8 +2501,9 @@ _hidden void libxl__egc_cleanup(libxl__egc *egc); libxl_ctx *ao__ctx = libxl__gc_owner(&ao->gc); \ assert(rc); \ libxl__ao_create_fail(ao); \ + libxl__egc_ao_cleanup_1_baton(&egc->gc); \ libxl__ctx_unlock(ao__ctx); /* gc is now invalid */ \ - EGC_FREE; \ + libxl__egc_cleanup_2_ul_cb_gc(egc); \ (rc); \ }) -- generated by git-patchbot for /home/xen/git/xen.git#master _______________________________________________ Xen-changelog mailing list Xen-changelog@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/xen-changelog
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |