Xen project Mailing List

Re: [MirageOS-devel] irmin storage overhead and dedup

To: Gregory Tsipenyuk <gt303@xxxxxxxxx>

From: Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx>

Date: Mon, 5 Jan 2015 16:51:41 +0100

Cc: mirageos-devel <mirageos-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 05 Jan 2015 15:51:45 +0000

List-id: Developer list for MirageOS <mirageos-devel.lists.xenproject.org>

> Does it make sense to have âbenchmarkâ folder under Irmin to check in the > tests? I've just created https://github.com/mirage/irmin-rt Thomas > >> On Dec 31, 2014, at 5:32 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote: >> >>> I looked at the metadata that gets created for every email message and itâs >>> small - less than 100 bytes. So I ran a simple test of appending 20,000 >>> unique 100 bytes ascii messages. I would have expected the repository size >>> to be on the order of a few megabytes, instead it was 4.7G. This is roughly >>> 234K overhead per 100 bytes message, which would be quite impractical for >>> the email storage with the metadata essentially exceeding the message >>> storage. >> >> Did you start from an empty repository? Would be interested to run your code >> locally to check what happens. >> >> More generally all the benchs/experiments you are running are very useful, >> it would be nice to put them somewhere online and turn them into functional >> tests to run them regularly to check that the serialisation format doesn't >> go crazy. >> >> Thanks! >> Thomas >> >> >> >> >>> >>> Gregory >>> >>>> On Dec 30, 2014, at 7:07 PM, Gregory Tsipenyuk <gt303@xxxxxxxxx> wrote: >>>> >>>> Hi Thomas, >>>> >>>> Iâm trying to figure out what kind of storage overhead and dedup I get in >>>> Irmin. First I tried to convert the google email archive (2.4G) to the >>>> IMAP server Irmin format . After conversion the size of the git repository >>>> was twice the size of the original archive. I do have some additional >>>> structures that I create, like per mailbox index and summary statistics >>>> and per email message flags so perhaps the extra size is coming from those >>>> structures though it seems a bit high. I will have to estimate the >>>> expected size from additional structures to understand this result. Next I >>>> dumped into irmin 2,000 of 1M files with random ascii content which >>>> resulted in the git repository size of 950M. I figure Irmin compresses the >>>> content, right? To verify this I dumped 2,000 of 2.4M image files with >>>> concatenated counter to make the content unique. The size of repository >>>> for this was 4.6G, which is expected. Then I repeated the last test but >>>> with identical images and this time the size was 27M, which was clearly a >>>> nice proof of the deduping by Irmin. My question is whether the >>>> compression in Irmin is configurable? Can it be configurable per >>>> individual content? For instance, I donât want to compress images as there >>>> is nothing to gain from the space saving and consequently there is >>>> unnecessary resource usage but I do want to compress the text if the >>>> compression overhead is reasonable. I can figure out the type of content >>>> from MIME type in IMAP server. >>>> >>>> Thanks >>>> Gregory >>> >> > _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.