[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] Irmin API newbie questions



On 4 March 2015 at 16:40, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote:
>> I've now extended this to support events, so if the user has the page
>> open in two tabs they stay in sync. However, there's still the problem
>> that you can't remove a watch, so once you start watching you have to
>> keep on doing so. Not a problem for my app (I just watch the master
>> branch) but might be a problem in general.
>
> Currently, if you do not use an event stream anymore, the stream reader 
> should be GCed automatically which will cause the stream writer (Irmin) to be 
> clean-up on the next write. However I'm not totally sure the readers are 
> properly GCed, so I need to check that before the release[1]. Did you notice 
> some leaks there? But yes, you as don't control when the GC happen it might 
> be better to leave the watch deallocation to the user. I need to think about 
> that.
>
> [1] https://github.com/mirage/irmin/issues/124

Sorry, I meant in the back-end. I have:

  let watch_all t =
    ignore (Lazy.force (t.listener));
    W.watch_all t.w

https://github.com/talex5/cuekeeper/blob/354a71805c3fc701b511d0d753524ac49430320b/html_storage.ml#L194

Forcing t.listener does "Dom.addEventListener Dom_html.window
Storage.event", which starts sending things to W.notify. However, I
have no way to know when events aren't needed any more so I can
disable the listener.

It's lazy so that only stores that have been watched at some point
create a listener (Irmin seems to create stores frequently - it looks
like every branch creates a new instance of the back-end, which seems
unnecessary).

>> I've also now made my app do merges: every item displayed on the
>> screen records the Irmin commit from which it came. Usually this is
>> the latest version, but if you're e.g. editing a form then it's the
>> version when you started the edit. When you submit a change, it
>> commits against that revision and then merges the result back to
>> master. This means that if you make two different edits to something
>> in two tabs, the conflict will be detected instead of silently
>> discarding the first change. (Ideally I want to resolve the conflicts,
>> but for now it just discards the second change and reports the error
>> to the user).
>>
>> I'm not sure I'm using the API correctly though. Here's my code for
>> monitoring the master branch and updating the display when it changes:
>>
>> https://github.com/talex5/cuekeeper/blob/83a5e3c146f2406b8491b1c9915cb7fef6753869/ck_update.ml#L30
>>
>>  let make ~on_update branch =
>>    let watch_tags = I.watch_tags (branch "Watch master") in
>>    get_head_commit branch >>= fun initial_head ->
>>    let mutex = Lwt_mutex.create () in
>>    match I.branch (branch "Get branch name") with
>>    | `Head _ -> failwith "Not a tag!"
>>    | `Tag branch_name ->
>>    let t = {
>>      branch;
>>      head = initial_head;
>>      updated = Lwt_condition.create ();
>>      mutex;
>>    } in
>>    async (fun () ->
>>      on_update >>= fun on_update ->
>>      watch_tags |> Lwt_stream.iter_s (function
>>        | (n, Some _commit) when n = branch_name ->
>>            (* (ignore the commit ID in the update message; we want
>> the latest) *)
>>            Lwt_mutex.with_lock t.mutex (fun () ->
>>              I.head (branch "Get latest commit") >>= function
>>              | None -> Ck_utils.error "Branch '%s' has disappeared!"
>> branch_name
>>              | Some head ->
>>                  if head <> t.head then (
>>                    t.head <- head;
>>                    on_update head >>= fun () ->
>>                    Lwt_condition.broadcast t.updated ();
>>                    return ()
>>                  ) else (
>>                    return ()
>>                  )
>>            )
>>        | _ -> return ()
>>      )
>>    );
>>    return t
>
> Hum, this clearly demonstrates that I need to improve the watch API ... I 
> completely agree that getting the initial head is important: we need 
> something similar to test_and_set at this level I think. I'm happy to move 
> away from Lwt streams and use a more conventional callback API, where it's 
> easier to control the scheduling (ie. hide the stream and expose only your 
> "make" function directly in Irmin). I need to check with Dave what he needs 
> exactly for Xenstore watches...
>
>> Some points you have to worry about here:
>>
>> - Irmin makes it easy to read data directly from a branch (tag), but
>> this is usually unsafe, since a tag can update while you're reading
>> it. Therefore, I think I always have to first get the head commit,
>> then create a new store for that revision, then read from that. It
>> might be safer if a branch just provided a way to get a commit, and
>> reads had to be done from a fixed commit.
>
> Hum, reading a tag should not be unsafe as the RW backend should ensure that 
> updates are atomic. See Get_unix.with_write_file[2] for example where we use 
> the fact that "rename" is the only operation guaranteed to be atomic on POSIX 
> filesystem.

That's OK if you're only reading a single key, but usually you'll want
a consistent set of values, all from the same commit. e.g. if I do

  I.read t "key1" >>= fun value1 ->
  I.read t "key2" >>= fun value2 ->

then value1 and value2 will usually but not always be from the same
commit, which is an easy mistake to make. This would be safer:

  Branch.head_commit t >>= fun c ->
  I.read c "key1" >>= fun value1 ->
  I.read c "key2" >>= fun value2 ->

> Not sure which guarantee do you have for HTML5 local storage though ... I'm 
> afraid you don't have many[3].

> [2] https://github.com/mirage/ocaml-git/blob/master/lib/unix/git_unix.ml#L300
> [3] 
> http://balpha.de/2012/03/javascript-concurrency-and-locking-the-html5-localstorage/

I think that only matters when the read and write happen in different
event loop turns. The spec says:

http://www.w3.org/TR/webstorage/#threads

"Because of the use of the storage mutex, multiple browsing contexts
will be able to access the local storage areas simultaneously in such
a manner that scripts cannot detect any concurrent script execution."

So I think I can do an atomic test-and-set, but I can't hold a lock
while waiting for an async event. I'd be happy to have something like:

let rec merge master view =
  Branch.head master >>= fun old_head ->
  let merged = merge old_head view in
  Branch.update_tag master ~from:old_head merged >>= function
  | `Ok -> ()
  | `Out_of_date -> merge master view

>> - I only need to be notified once after a series of changes. I discard
>> the commit in the watch notification and always read the latest.
>> Otherwise, I might update the GUI to show stale data. Since a merge
>> can add many commits at once, the stream API doesn't seem useful (it
>> still doesn't ensure you see each commit individually, if that's what
>> you wanted).
>
> In some cases this might be useful (if you store incremental updates that you 
> want to replay) but I agree that this is not terribly useful in the main case.
>
>> - The functions that update the store shouldn't return until the
>> update has been see by the model ("on_update" has been called). I
>> therefore create a mutex and a Lwt_condition so that the commit
>> functions can be sure their change has been seen.
>
> agreed and I think that's a good property to move into the watch API.

I'm not sure this bit should be moved to the main library. In many
cases you wouldn't care about waiting for the updates to be seen. But
we could have a wrapper for the branch API that provides this
(operations that set the branch pointer don't return until the
notification callback has run, or isn't going to run if the head
didn't move).

>> All the code for changing the model goes via "merge_to_master". This
>> takes a "base" argument (the commit which the user made the change
>> from), creates a view (temporary branch), runs "fn" to make the
>> requested change (which should always succeed), merges the temporary
>> branch back to master, and waits for the on_update to notice it:
>>
>>  let merge_to_master t ~base ~msg fn =
>>    let path = I.Key.empty in
>>    R.make_view base >>= fun view ->
>>    fn view >>= fun result ->
>>    Lwt_mutex.with_lock t.mutex (fun () ->
>>      (* At this point, head cannot contain our commit because we
>> haven't made it yet,
>>       * and no updates can happen while we hold the lock. *)
>>      let old_head = t.head in
>>      let updated = Lwt_condition.wait t.updated in
>>      R.V.merge_path (t.branch msg) path view >>= function
>>      | `Conflict msg -> Ck_utils.error "Conflict during merge: %s
>> (discarding change)" msg
>>      | `Ok () ->
>>      I.head (t.branch "Get latest commit") >|= function
>>      | Some new_head when new_head <> old_head ->
>>          (* [updated] cannot have fired yet because we still hold the
>> lock. When it does
>>           * fire next, it must contain our update. It must fire soon,
>> as head has changed. *)
>>          updated
>>      | _ ->
>>          (* Our change had no effect, so don't wait for head to move.
>>           * Or, the branch was deleted - no point waiting then either. *)
>>          return ()
>>    ) >>= fun updated ->    (* Changes have been committed. *)
>>    updated >>= fun () ->   (* [on_update] has been called. *)
>>    return result
>>
>> Does this approach seem sensible? I feel like the API perhaps wasn't
>> intended to be used this way...
>
> I can see why you are doing all of this, and this is certainly not optimal 
> for an API point of view. One think I can do to help is to add an invariant 
> to all the high-level updates (such as merge) functions to only perform the 
> whole update only if the head didn't change since the beginning of the 
> operations. It's an easy check to add and it will help a lot to reason on 
> what happens during concurrent updates. The only reason I didn't do it before 
> is that I was worried about starvation issues and other concurrency 
> scheduling related issues - but I guess they are better than data loss ... 
> and we could just picky bag on Lwt_mutex / Lwt_condition properties.
>
> [4] https://github.com/mirage/irmin/issues/157
>
>> I'm still worried about losing data due to races. JavaScript (I think)
>> ensures that access to local storage from different threads doesn't
>> get interleaved, but since Irmin uses Lwt we don't benefit from this.
>> I suspect that if two tabs try to merge at once then one's commit will
>> be silently discarded. Probably the back-end update function should
>> take the expected old value as an argument so it can abort if the
>> state has changed since then.
>
> Not sure to understand what Lwt exactly changes there. I'm happy to modify 
> the signature of the backends to add more checks, but modifying the 
> "RW.update" signature to take an optional current value to test would mean 
> modifying the "Irmin.update" signature as well. Not sure that's exactly what 
> we want. Could add a new "RW.test_and_set" function though.[4] I'll think 
> about it.

That would be good.

> [5] https://github.com/mirage/irmin/issues/155
>
>> It would also be useful to get the commit object from the merge before
>> updating the branch. That way, I could try loading it as a sanity
>> check (and thus reject any update that would result in an unreadable
>> state). I guess I need to create another temporary branch, merge to
>> that, then force-update master to it. Seems like a lot of work,
>> though.
>
> I think it's easy to add a "try_merge" function which do not update the head. 
> The current way to do that is indeed to create a temporary branch.
>
> [6] https://github.com/mirage/irmin/issues/156
>
>> Anyway, despite these minor issues, Irmin is working great so far!
>
> As usual, thanks for the very useful feedback! That'll definitely help 
> improving the watch API and the overhaul usability.

Thanks!


-- 
Dr Thomas Leonard        http://0install.net/
GPG: 9242 9807 C985 3C07 44A6  8B9A AE07 8280 59A5 3CC1
GPG: DA98 25AE CAD0 8975 7CDA  BD8E 0713 3F96 CA74 D8BA

_______________________________________________
MirageOS-devel mailing list
MirageOS-devel@xxxxxxxxxxxxxxxxxxxx
http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.