[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] Irmin GC

> Proposal:
> - Irmin should provide a smallish GC-safe core (BC?) that hides the
> internal stores completely and provides an API that will not GC data
> you're using. This API will distinguish between a "commit ID" (which
> might or might not represent a commit in the repository) and a
> "commit" (which refers to a commit in the repository and will prevent
> it from being removed as long as you keep an OCaml reference to it).
> - The core will need to provide named branches, commits and mutable
> indexes/staging-areas that act as GC roots.
> - Ir_view and Ir_sync should be implemented on top of this API, so
> they don't have to worry about GC. Higher-level operations such as
> merge should probably move to Ir_view too.

I am not sure about the distinction between commit and commit_id. What does it 
mean in term of API? Do you duplicate every functionsI to take both kinds as 
argument? Also how the user decide when to create a commit or a commit_id? 
Persistent commit vs. non-persistent commit might make sense, but what happen 
if the parents of a persistent commit are not persistent: do they become 
persistent? Are they GC'ed as well?

I fully agree about the GC-safe core and making Ir_view and Ir_sync use them.

> Issues:
> - A "commit" should keep its contents (trees and blobs) from being
> GC'd, but what about its parents? If we want to allow shallow clones,
> we might need to allow for a commit's parents to be missing.

Support for shallow clone is a needed feature I think for performance reasons 
but I think can be separated from GC issues (e.g. just assume that some 
pointers might be dangling in the block store).

> - GC with remote HTTP stores could be tricky. For custom protocols, GC
> can be linked to the TCP connection, but HTTP is often spread over
> multiple connections. Probably OK for the high-level API, but we might
> have to remove the low one (I'm not very familiar with this REST API,
> and so might be confused).

Yes, kill the low-level store if that's possible. I think its main use 
currently is for merges and to simplify the watch hooks.

Also historically it was the first bits to be implemented as it requires little 
logic from the server: using the low-level API, the client is responsible to do 
everything (at the cost of multiple round-trips per high-level operations). The 
high-level API gives more work to the server, but it exposes less private 
things to the clients and is more efficient in terms of round-trips.

> - If the user runs "git gc" manually on a Git-format store then all
> bets are off, of course. Likewise if you have a store shared by
> multiple processes.

I think it is important to keep the multiple-process safe if possible. Could be 
as simple as the GC adding a lock file somewhere (which will stop the world). 
If we enforce having only one Irmin process running over a local store, the 
invariant should be checked carefully.

Last missing issues: temporary objects stored in the block store but not yet 
related to GC roots:

A. When you transform a staging area into a new commit (for instance in views, 
but also when you do a simple update):

(A1) iterate first over all the new blobs and tree objects to serialise them in 
the block store and get their hash.
(A2) create a commit object containing the new hash of the tree root, and 
serialise it in the block store to get the commit ID
(A3) (optionally) update a branch reference to point to the new ID.

My main concern with external GC is that before (A3) is done, objects saved in 
(A1) and (A2) are unsafe and can be deleted at any moment.

B. When you merge commits:

(B1) inductively merge blobs and tree objects, serialise them in the store to 
get their ID
B2, B3: same as (A2) and (A3)

Again, (B1), (B2) are unsafe.

MirageOS-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.