[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [MirageOS-devel] irmin storage overhead and dedup
Thanks! Iâll check it in. > On Jan 5, 2015, at 10:51 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote: > >> Does it make sense to have âbenchmarkâ folder under Irmin to check in the >> tests? > > I've just created https://github.com/mirage/irmin-rt > > Thomas > >> >>> On Dec 31, 2014, at 5:32 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> >>> wrote: >>> >>>> I looked at the metadata that gets created for every email message and >>>> itâs small - less than 100 bytes. So I ran a simple test of appending >>>> 20,000 unique 100 bytes ascii messages. I would have expected the >>>> repository size to be on the order of a few megabytes, instead it was >>>> 4.7G. This is roughly 234K overhead per 100 bytes message, which would be >>>> quite impractical for the email storage with the metadata essentially >>>> exceeding the message storage. >>> >>> Did you start from an empty repository? Would be interested to run your >>> code locally to check what happens. >>> >>> More generally all the benchs/experiments you are running are very useful, >>> it would be nice to put them somewhere online and turn them into functional >>> tests to run them regularly to check that the serialisation format doesn't >>> go crazy. >>> >>> Thanks! >>> Thomas >>> >>> >>> >>> >>>> >>>> Gregory >>>> >>>>> On Dec 30, 2014, at 7:07 PM, Gregory Tsipenyuk <gt303@xxxxxxxxx> wrote: >>>>> >>>>> Hi Thomas, >>>>> >>>>> Iâm trying to figure out what kind of storage overhead and dedup I get in >>>>> Irmin. First I tried to convert the google email archive (2.4G) to the >>>>> IMAP server Irmin format . After conversion the size of the git >>>>> repository was twice the size of the original archive. I do have some >>>>> additional structures that I create, like per mailbox index and summary >>>>> statistics and per email message flags so perhaps the extra size is >>>>> coming from those structures though it seems a bit high. I will have to >>>>> estimate the expected size from additional structures to understand this >>>>> result. Next I dumped into irmin 2,000 of 1M files with random ascii >>>>> content which resulted in the git repository size of 950M. I figure Irmin >>>>> compresses the content, right? To verify this I dumped 2,000 of 2.4M >>>>> image files with concatenated counter to make the content unique. The >>>>> size of repository for this was 4.6G, which is expected. Then I repeated >>>>> the last test but with identical images and this time the size was 27M, >>>>> which was clearly a nice proof of the deduping by Irmin. My question is >>>>> whether the compression in Irmin is configurable? Can it be configurable >>>>> per individual content? For instance, I donât want to compress images as >>>>> there is nothing to gain from the space saving and consequently there is >>>>> unnecessary resource usage but I do want to compress the text if the >>>>> compression overhead is reasonable. I can figure out the type of content >>>>> from MIME type in IMAP server. >>>>> >>>>> Thanks >>>>> Gregory >>>> >>> >> > _______________________________________________ MirageOS-devel mailing list MirageOS-devel@xxxxxxxxxxxxxxxxxxxx http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |