[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] Irmin GC

On 15 October 2015 at 15:16, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote:
>> Proposal:
>> - Irmin should provide a smallish GC-safe core (BC?) that hides the
>> internal stores completely and provides an API that will not GC data
>> you're using. This API will distinguish between a "commit ID" (which
>> might or might not represent a commit in the repository) and a
>> "commit" (which refers to a commit in the repository and will prevent
>> it from being removed as long as you keep an OCaml reference to it).
>> - The core will need to provide named branches, commits and mutable
>> indexes/staging-areas that act as GC roots.
>> - Ir_view and Ir_sync should be implemented on top of this API, so
>> they don't have to worry about GC. Higher-level operations such as
>> merge should probably move to Ir_view too.
> I am not sure about the distinction between commit and commit_id. What does 
> it mean in term of API? Do you duplicate every functionsI to take both kinds 
> as argument?

No, you just have a single function:

    BC.Repo.commit_of_id: t -> commit_id -> commit option Lwt.t

If this returns None then the commit wasn't in the store. If it
returns Some commit then that commit will stay in the store as long as
you hold the commit value. Then e.g. "task_of_commit" don't need to
return an option*, because you know the task will still be there.

* (actually, task_of_commit_id currently throws an exception if the
commit isn't in the store, which isn't ideal)

> Also how the user decide when to create a commit or a commit_id? Persistent 
> commit vs. non-persistent commit might make sense, but what happen if the 
> parents of a persistent commit are not persistent: do they become persistent? 
> Are they GC'ed as well?

You could think of a commit_id as like a weak ref to a commit.

> I fully agree about the GC-safe core and making Ir_view and Ir_sync use them.
>> Issues:
>> - A "commit" should keep its contents (trees and blobs) from being
>> GC'd, but what about its parents? If we want to allow shallow clones,
>> we might need to allow for a commit's parents to be missing.
> Support for shallow clone is a needed feature I think for performance reasons 
> but I think can be separated from GC issues (e.g. just assume that some 
> pointers might be dangling in the block store).
>> - GC with remote HTTP stores could be tricky. For custom protocols, GC
>> can be linked to the TCP connection, but HTTP is often spread over
>> multiple connections. Probably OK for the high-level API, but we might
>> have to remove the low one (I'm not very familiar with this REST API,
>> and so might be confused).
> Yes, kill the low-level store if that's possible. I think its main use 
> currently is for merges and to simplify the watch hooks.
> Also historically it was the first bits to be implemented as it requires 
> little logic from the server: using the low-level API, the client is 
> responsible to do everything (at the cost of multiple round-trips per 
> high-level operations). The high-level API gives more work to the server, but 
> it exposes less private things to the clients and is more efficient in terms 
> of round-trips.
>> - If the user runs "git gc" manually on a Git-format store then all
>> bets are off, of course. Likewise if you have a store shared by
>> multiple processes.
> I think it is important to keep the multiple-process safe if possible. Could 
> be as simple as the GC adding a lock file somewhere (which will stop the 
> world). If we enforce having only one Irmin process running over a local 
> store, the invariant should be checked carefully.

I don't see how multi-process can ever work if you allow anonymous branches.

> Last missing issues: temporary objects stored in the block store but not yet 
> related to GC roots:
> A. When you transform a staging area into a new commit (for instance in 
> views, but also when you do a simple update):
> (A1) iterate first over all the new blobs and tree objects to serialise them 
> in the block store and get their hash.
> (A2) create a commit object containing the new hash of the tree root, and 
> serialise it in the block store to get the commit ID
> (A3) (optionally) update a branch reference to point to the new ID.
> My main concern with external GC is that before (A3) is done, objects saved 
> in (A1) and (A2) are unsafe and can be deleted at any moment.

Taking the GC lock file should sort this out, I think. BC can provide
a commit function that takes the lock, serialises everything at once,
then releases it.

> B. When you merge commits:
> (B1) inductively merge blobs and tree objects, serialise them in the store to 
> get their ID
> B2, B3: same as (A2) and (A3)
> Again, (B1), (B2) are unsafe.
> Thomas

Dr Thomas Leonard        http://roscidus.com/blog/
GPG: DA98 25AE CAD0 8975 7CDA  BD8E 0713 3F96 CA74 D8BA

MirageOS-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.