[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] irmin storage overhead and dedup

Thanks! Iâll check it in.

> On Jan 5, 2015, at 10:51 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote:
>> Does it make sense to have âbenchmarkâ folder under Irmin to check in the 
>> tests?
> I've just created https://github.com/mirage/irmin-rt
> Thomas
>>> On Dec 31, 2014, at 5:32 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> 
>>> wrote:
>>>> I looked at the metadata that gets created for every email message and 
>>>> itâs small - less than 100 bytes. So I ran a simple test of appending 
>>>> 20,000 unique 100 bytes ascii messages. I would have expected the 
>>>> repository size to be on the order of a few megabytes, instead it was 
>>>> 4.7G. This is roughly 234K overhead per 100 bytes message, which would be 
>>>> quite impractical for the email storage with the metadata essentially 
>>>> exceeding the message storage.
>>> Did you start from an empty repository? Would be interested to run your 
>>> code locally to check what happens. 
>>> More generally all the benchs/experiments you are running are very useful, 
>>> it would be nice to put them somewhere online and turn them into functional 
>>> tests to run them regularly to check that the serialisation format doesn't 
>>> go crazy.
>>> Thanks!
>>> Thomas
>>>> Gregory
>>>>> On Dec 30, 2014, at 7:07 PM, Gregory Tsipenyuk <gt303@xxxxxxxxx> wrote:
>>>>> Hi Thomas,
>>>>> Iâm trying to figure out what kind of storage overhead and dedup I get in 
>>>>> Irmin. First I tried to convert the google email archive (2.4G) to the 
>>>>> IMAP server Irmin format . After conversion the size of the git 
>>>>> repository was twice the size of the original archive. I do have some 
>>>>> additional structures that I create, like per mailbox index and summary 
>>>>> statistics and per email message flags so perhaps the extra size is 
>>>>> coming from those structures though it seems a bit high. I will have to 
>>>>> estimate the expected size from additional structures to understand this 
>>>>> result. Next I dumped into irmin 2,000 of 1M files with random ascii 
>>>>> content which resulted in the git repository size of 950M. I figure Irmin 
>>>>> compresses the content, right? To verify this I dumped 2,000 of 2.4M 
>>>>> image files with concatenated counter to make the content unique. The 
>>>>> size of repository for this was 4.6G, which is expected. Then I repeated 
>>>>> the last test but with identical images and this time the size was 27M, 
>>>>> which was clearly a nice proof of the deduping by Irmin. My question is 
>>>>> whether the compression in Irmin is configurable? Can it be configurable 
>>>>> per individual content? For instance, I donât want to compress images as 
>>>>> there is nothing to gain from the space saving and consequently there is 
>>>>> unnecessary resource usage but I do want to compress the text if the 
>>>>> compression overhead is reasonable. I can figure out the type of content 
>>>>> from MIME type in IMAP server.
>>>>> Thanks 
>>>>> Gregory

MirageOS-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.