Xen project Mailing List

Re: [MirageOS-devel] irmin storage overhead and dedup

To: Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx>

From: Gregory Tsipenyuk <gt303@xxxxxxxxx>

Date: Wed, 31 Dec 2014 09:56:51 -0500

Cc: mirageos-devel <mirageos-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 31 Dec 2014 14:56:57 +0000

List-id: Developer list for MirageOS <mirageos-devel.lists.xenproject.org>

Thanks for the clarification and for the references! Gregory > On Dec 31, 2014, at 5:27 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote: > > Hi Gregory, > >> Iâm trying to figure out what kind of storage overhead and dedup I get in >> Irmin. First I tried to convert the google email archive (2.4G) to the IMAP >> server Irmin format . After conversion the size of the git repository was >> twice the size of the original archive. I do have some additional structures >> that I create, like per mailbox index and summary statistics and per email >> message flags so perhaps the extra size is coming from those structures >> though it seems a bit high. I will have to estimate the expected size from >> additional structures to understand this result. Next I dumped into irmin >> 2,000 of 1M files with random ascii content which resulted in the git >> repository size of 950M. I figure Irmin compresses the content, right? To >> verify this I dumped 2,000 of 2.4M image files with concatenated counter to >> make the content unique. The size of repository for this was 4.6G, which is >> expected. Then I repeated the last test but with identical images and this >> time the size was 27M, which was clearly a nice proof of the deduping by >> Irmin. My question is whether the compression in Irmin is configurable? Can >> it be configurable per individual content? For instance, I donât want to >> compress images as there is nothing to gain from the space saving and >> consequently there is unnecessary resource usage but I do want to compress >> the text if the compression overhead is reasonable. I can figure out the >> type of content from MIME type in IMAP server. > > Indeed, Irmin deduplicate similar contents automatically as it uses the > digest of internal objects as internal keys (that's similar to content > addressable stores or hash-consing). > > For the compression, it depends on the backend. If you use `Irmin_fs` you > will have none. If you use the Git backend, you have two kinds of > compressions: > - the Git serialisation format extensively uses the zlib library. Basically > every chunk of bytes is compressed, being the blobs or the the tree and > commit metadata > - you can run `git gc` at the root of your Git repository and the Git tool > will try to compact similar contents together in `.pack` files. The algorithm > sort the contents by filename, then by size and then use a sliding-window to > find similar contents and compute a diff-based representation. Some kind of > `git gc` is also run before doing a push/pull, as the `.pack` files are > exactly what is sent over the network. See [1] for a great explanation of the > Git format. > > It would be possible to control the level of compression by explicitly > passing a ~level argument to Zlib.compress[2] in ocaml-git[3] but it is not > done currently[4]. I'm not sure it is possible to disable the compression > completely and still be compatible with the Git format though. I'll add this > option in the next release. > > Best, > Thomas > > [1] http://stefan.saasen.me/articles/git-clone-in-haskell-from-the-bottom-up/ > [2] http://nit.gforge.inria.fr/camlzip/Zlib.html#VALCompress > [3] https://github.com/mirage/ocaml-git/blob/master/lib/misc.ml#L75 > [4] https://github.com/mirage/ocaml-git/issues/41 > _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.