Xen project Mailing List

Re: [MirageOS-devel] irmin storage overhead and dedup

To: Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx>

From: Gregory Tsipenyuk <gt303@xxxxxxxxx>

Date: Wed, 31 Dec 2014 09:47:26 -0500

Cc: mirageos-devel <mirageos-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Wed, 31 Dec 2014 14:47:39 +0000

List-id: Developer list for MirageOS <mirageos-devel.lists.xenproject.org>

I ran all tests on empty repositories. Does it make sense to have âbenchmarkâ folder under Irmin to check in the tests? > On Dec 31, 2014, at 5:32 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote: > >> I looked at the metadata that gets created for every email message and itâs >> small - less than 100 bytes. So I ran a simple test of appending 20,000 >> unique 100 bytes ascii messages. I would have expected the repository size >> to be on the order of a few megabytes, instead it was 4.7G. This is roughly >> 234K overhead per 100 bytes message, which would be quite impractical for >> the email storage with the metadata essentially exceeding the message >> storage. > > Did you start from an empty repository? Would be interested to run your code > locally to check what happens. > > More generally all the benchs/experiments you are running are very useful, it > would be nice to put them somewhere online and turn them into functional > tests to run them regularly to check that the serialisation format doesn't go > crazy. > > Thanks! > Thomas > > > > >> >> Gregory >> >>> On Dec 30, 2014, at 7:07 PM, Gregory Tsipenyuk <gt303@xxxxxxxxx> wrote: >>> >>> Hi Thomas, >>> >>> Iâm trying to figure out what kind of storage overhead and dedup I get in >>> Irmin. First I tried to convert the google email archive (2.4G) to the IMAP >>> server Irmin format . After conversion the size of the git repository was >>> twice the size of the original archive. I do have some additional >>> structures that I create, like per mailbox index and summary statistics and >>> per email message flags so perhaps the extra size is coming from those >>> structures though it seems a bit high. I will have to estimate the expected >>> size from additional structures to understand this result. Next I dumped >>> into irmin 2,000 of 1M files with random ascii content which resulted in >>> the git repository size of 950M. I figure Irmin compresses the content, >>> right? To verify this I dumped 2,000 of 2.4M image files with concatenated >>> counter to make the content unique. The size of repository for this was >>> 4.6G, which is expected. Then I repeated the last test but with identical >>> images and this time the size was 27M, which was clearly a nice proof of >>> the deduping by Irmin. My question is whether the compression in Irmin is >>> configurable? Can it be configurable per individual content? For instance, >>> I donât want to compress images as there is nothing to gain from the space >>> saving and consequently there is unnecessary resource usage but I do want >>> to compress the text if the compression overhead is reasonable. I can >>> figure out the type of content from MIME type in IMAP server. >>> >>> Thanks >>> Gregory >> > _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.