[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] irmin storage overhead and dedup

Hi Gregory,

> Iâm trying to figure out what kind of storage overhead and dedup I get in 
> Irmin. First I tried to convert the google email archive (2.4G) to the IMAP 
> server Irmin format . After conversion the size of the git repository was 
> twice the size of the original archive. I do have some additional structures 
> that I create, like per mailbox index and summary statistics and per email 
> message flags so perhaps the extra size is coming from those structures 
> though it seems a bit high. I will have to estimate the expected size from 
> additional structures to understand this result. Next I dumped into irmin 
> 2,000 of 1M files with random ascii content which resulted in the git 
> repository size of 950M. I figure Irmin compresses the content, right? To 
> verify this I dumped 2,000 of 2.4M image files with concatenated counter to 
> make the content unique. The size of repository for this was 4.6G, which is 
> expected. Then I repeated the last test but with identical images and this 
> time the size was 27M, which was clearly a nice proof of the deduping by 
> Irmin. My question is whether the compression in Irmin is configurable? Can 
> it be configurable per individual content? For instance, I donât want to 
> compress images as there is nothing to gain from the space saving and 
> consequently there is unnecessary resource usage but I do want to compress 
> the text if the compression overhead is reasonable. I can figure out the type 
> of content from MIME type in IMAP server.

Indeed, Irmin deduplicate similar contents automatically as it uses the digest 
of internal objects as internal keys (that's similar to content addressable 
stores or hash-consing).

For the compression, it depends on the backend. If you use `Irmin_fs` you will 
have none. If you use the Git backend, you have two kinds of compressions:
- the Git serialisation format extensively uses the zlib library. Basically 
every chunk of bytes is compressed, being the blobs or the the tree and commit 
- you can run `git gc` at the root of your Git repository and the Git tool will 
try to compact similar contents together in `.pack` files. The algorithm sort 
the contents by filename, then by size and then use a sliding-window to find 
similar contents and compute a diff-based representation. Some kind of `git gc` 
is also run before doing a push/pull, as the `.pack` files are exactly what is 
sent over the network. See [1] for a great explanation of the Git format.

It would be possible to control the level of compression by explicitly passing 
a ~level argument to Zlib.compress[2] in ocaml-git[3] but it is not done 
currently[4]. I'm not sure it is possible to disable the compression completely 
and still be compatible with the Git format though. I'll add this option in the 
next release.


[1] http://stefan.saasen.me/articles/git-clone-in-haskell-from-the-bottom-up/
[2] http://nit.gforge.inria.fr/camlzip/Zlib.html#VALCompress
[3] https://github.com/mirage/ocaml-git/blob/master/lib/misc.ml#L75
[4] https://github.com/mirage/ocaml-git/issues/41

MirageOS-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.