[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] irmin storage overhead and dedup

Thanks for the clarification and for the references!


> On Dec 31, 2014, at 5:27 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote:
> Hi Gregory,
>> Iâm trying to figure out what kind of storage overhead and dedup I get in 
>> Irmin. First I tried to convert the google email archive (2.4G) to the IMAP 
>> server Irmin format . After conversion the size of the git repository was 
>> twice the size of the original archive. I do have some additional structures 
>> that I create, like per mailbox index and summary statistics and per email 
>> message flags so perhaps the extra size is coming from those structures 
>> though it seems a bit high. I will have to estimate the expected size from 
>> additional structures to understand this result. Next I dumped into irmin 
>> 2,000 of 1M files with random ascii content which resulted in the git 
>> repository size of 950M. I figure Irmin compresses the content, right? To 
>> verify this I dumped 2,000 of 2.4M image files with concatenated counter to 
>> make the content unique. The size of repository for this was 4.6G, which is 
>> expected. Then I repeated the last test but with identical images and this 
>> time the size was 27M, which was clearly a nice proof of the deduping by 
>> Irmin. My question is whether the compression in Irmin is configurable? Can 
>> it be configurable per individual content? For instance, I donât want to 
>> compress images as there is nothing to gain from the space saving and 
>> consequently there is unnecessary resource usage but I do want to compress 
>> the text if the compression overhead is reasonable. I can figure out the 
>> type of content from MIME type in IMAP server.
> Indeed, Irmin deduplicate similar contents automatically as it uses the 
> digest of internal objects as internal keys (that's similar to content 
> addressable stores or hash-consing).
> For the compression, it depends on the backend. If you use `Irmin_fs` you 
> will have none. If you use the Git backend, you have two kinds of 
> compressions:
> - the Git serialisation format extensively uses the zlib library. Basically 
> every chunk of bytes is compressed, being the blobs or the the tree and 
> commit metadata
> - you can run `git gc` at the root of your Git repository and the Git tool 
> will try to compact similar contents together in `.pack` files. The algorithm 
> sort the contents by filename, then by size and then use a sliding-window to 
> find similar contents and compute a diff-based representation. Some kind of 
> `git gc` is also run before doing a push/pull, as the `.pack` files are 
> exactly what is sent over the network. See [1] for a great explanation of the 
> Git format.
> It would be possible to control the level of compression by explicitly 
> passing a ~level argument to Zlib.compress[2] in ocaml-git[3] but it is not 
> done currently[4]. I'm not sure it is possible to disable the compression 
> completely and still be compatible with the Git format though. I'll add this 
> option in the next release.
> Best,
> Thomas
> [1] http://stefan.saasen.me/articles/git-clone-in-haskell-from-the-bottom-up/
> [2] http://nit.gforge.inria.fr/camlzip/Zlib.html#VALCompress
> [3] https://github.com/mirage/ocaml-git/blob/master/lib/misc.ml#L75
> [4] https://github.com/mirage/ocaml-git/issues/41

MirageOS-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.