[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MirageOS-devel] irmin storage overhead and dedup



> I looked at the metadata that gets created for every email message and itâs 
> small - less than 100 bytes. So I ran a simple test of appending 20,000 
> unique 100 bytes ascii messages. I would have expected the repository size to 
> be on the order of a few megabytes, instead it was 4.7G. This is roughly 234K 
> overhead per 100 bytes message, which would be quite impractical for the 
> email storage with the metadata essentially exceeding the message storage.

Did you start from an empty repository? Would be interested to run your code 
locally to check what happens. 

More generally all the benchs/experiments you are running are very useful, it 
would be nice to put them somewhere online and turn them into functional tests 
to run them regularly to check that the serialisation format doesn't go crazy.

Thanks!
Thomas




> 
> Gregory
> 
>> On Dec 30, 2014, at 7:07 PM, Gregory Tsipenyuk <gt303@xxxxxxxxx> wrote:
>> 
>> Hi Thomas,
>> 
>> Iâm trying to figure out what kind of storage overhead and dedup I get in 
>> Irmin. First I tried to convert the google email archive (2.4G) to the IMAP 
>> server Irmin format . After conversion the size of the git repository was 
>> twice the size of the original archive. I do have some additional structures 
>> that I create, like per mailbox index and summary statistics and per email 
>> message flags so perhaps the extra size is coming from those structures 
>> though it seems a bit high. I will have to estimate the expected size from 
>> additional structures to understand this result. Next I dumped into irmin 
>> 2,000 of 1M files with random ascii content which resulted in the git 
>> repository size of 950M. I figure Irmin compresses the content, right? To 
>> verify this I dumped 2,000 of 2.4M image files with concatenated counter to 
>> make the content unique. The size of repository for this was 4.6G, which is 
>> expected. Then I repeated the last test but with identical images and this 
>> time the size was 27M, which was clearly a nice proof of the deduping by 
>> Irmin. My question is whether the compression in Irmin is configurable? Can 
>> it be configurable per individual content? For instance, I donât want to 
>> compress images as there is nothing to gain from the space saving and 
>> consequently there is unnecessary resource usage but I do want to compress 
>> the text if the compression overhead is reasonable. I can figure out the 
>> type of content from MIME type in IMAP server.
>> 
>> Thanks 
>> Gregory
> 


_______________________________________________
MirageOS-devel mailing list
MirageOS-devel@xxxxxxxxxxxxxxxxxxxx
http://lists.xenproject.org/cgi-bin/mailman/listinfo/mirageos-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.