Xen project Mailing List

Re: [MirageOS-devel] irmin storage overhead and dedup

To: Gregory Tsipenyuk <gt303@xxxxxxxxx>

From: Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx>

Date: Wed, 31 Dec 2014 11:27:06 +0100

Cc: mirageos-devel <mirageos-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 31 Dec 2014 10:27:12 +0000

List-id: Developer list for MirageOS <mirageos-devel.lists.xenproject.org>

Hi Gregory, > Iâm trying to figure out what kind of storage overhead and dedup I get in > Irmin. First I tried to convert the google email archive (2.4G) to the IMAP > server Irmin format . After conversion the size of the git repository was > twice the size of the original archive. I do have some additional structures > that I create, like per mailbox index and summary statistics and per email > message flags so perhaps the extra size is coming from those structures > though it seems a bit high. I will have to estimate the expected size from > additional structures to understand this result. Next I dumped into irmin > 2,000 of 1M files with random ascii content which resulted in the git > repository size of 950M. I figure Irmin compresses the content, right? To > verify this I dumped 2,000 of 2.4M image files with concatenated counter to > make the content unique. The size of repository for this was 4.6G, which is > expected. Then I repeated the last test but with identical images and this > time the size was 27M, which was clearly a nice proof of the deduping by > Irmin. My question is whether the compression in Irmin is configurable? Can > it be configurable per individual content? For instance, I donât want to > compress images as there is nothing to gain from the space saving and > consequently there is unnecessary resource usage but I do want to compress > the text if the compression overhead is reasonable. I can figure out the type > of content from MIME type in IMAP server. Indeed, Irmin deduplicate similar contents automatically as it uses the digest of internal objects as internal keys (that's similar to content addressable stores or hash-consing). For the compression, it depends on the backend. If you use `Irmin_fs` you will have none. If you use the Git backend, you have two kinds of compressions: - the Git serialisation format extensively uses the zlib library. Basically every chunk of bytes is compressed, being the blobs or the the tree and commit metadata - you can run `git gc` at the root of your Git repository and the Git tool will try to compact similar contents together in `.pack` files. The algorithm sort the contents by filename, then by size and then use a sliding-window to find similar contents and compute a diff-based representation. Some kind of `git gc` is also run before doing a push/pull, as the `.pack` files are exactly what is sent over the network. See [1] for a great explanation of the Git format. It would be possible to control the level of compression by explicitly passing a ~level argument to Zlib.compress[2] in ocaml-git[3] but it is not done currently[4]. I'm not sure it is possible to disable the compression completely and still be compatible with the Git format though. I'll add this option in the next release. Best, Thomas [1] http://stefan.saasen.me/articles/git-clone-in-haskell-from-the-bottom-up/ [2] http://nit.gforge.inria.fr/camlzip/Zlib.html#VALCompress [3] https://github.com/mirage/ocaml-git/blob/master/lib/misc.ml#L75 [4] https://github.com/mirage/ocaml-git/issues/41 _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.