[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] irmin storage overhead and dedup

> Does it make sense to have âbenchmarkâ folder under Irmin to check in the 
> tests?

I've just created https://github.com/mirage/irmin-rt


>> On Dec 31, 2014, at 5:32 AM, Thomas Gazagnaire <thomas@xxxxxxxxxxxxxx> wrote:
>>> I looked at the metadata that gets created for every email message and itâs 
>>> small - less than 100 bytes. So I ran a simple test of appending 20,000 
>>> unique 100 bytes ascii messages. I would have expected the repository size 
>>> to be on the order of a few megabytes, instead it was 4.7G. This is roughly 
>>> 234K overhead per 100 bytes message, which would be quite impractical for 
>>> the email storage with the metadata essentially exceeding the message 
>>> storage.
>> Did you start from an empty repository? Would be interested to run your code 
>> locally to check what happens. 
>> More generally all the benchs/experiments you are running are very useful, 
>> it would be nice to put them somewhere online and turn them into functional 
>> tests to run them regularly to check that the serialisation format doesn't 
>> go crazy.
>> Thanks!
>> Thomas
>>> Gregory
>>>> On Dec 30, 2014, at 7:07 PM, Gregory Tsipenyuk <gt303@xxxxxxxxx> wrote:
>>>> Hi Thomas,
>>>> Iâm trying to figure out what kind of storage overhead and dedup I get in 
>>>> Irmin. First I tried to convert the google email archive (2.4G) to the 
>>>> IMAP server Irmin format . After conversion the size of the git repository 
>>>> was twice the size of the original archive. I do have some additional 
>>>> structures that I create, like per mailbox index and summary statistics 
>>>> and per email message flags so perhaps the extra size is coming from those 
>>>> structures though it seems a bit high. I will have to estimate the 
>>>> expected size from additional structures to understand this result. Next I 
>>>> dumped into irmin 2,000 of 1M files with random ascii content which 
>>>> resulted in the git repository size of 950M. I figure Irmin compresses the 
>>>> content, right? To verify this I dumped 2,000 of 2.4M image files with 
>>>> concatenated counter to make the content unique. The size of repository 
>>>> for this was 4.6G, which is expected. Then I repeated the last test but 
>>>> with identical images and this time the size was 27M, which was clearly a 
>>>> nice proof of the deduping by Irmin. My question is whether the 
>>>> compression in Irmin is configurable? Can it be configurable per 
>>>> individual content? For instance, I donât want to compress images as there 
>>>> is nothing to gain from the space saving and consequently there is 
>>>> unnecessary resource usage but I do want to compress the text if the 
>>>> compression overhead is reasonable. I can figure out the type of content 
>>>> from MIME type in IMAP server.
>>>> Thanks 
>>>> Gregory

MirageOS-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.