[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Project discussion log



[23:59] <vr34> Hi Jesus
[00:01] <jgbarah> Hi!
[00:01] <vr34> So, let me summarize what I have understood of the project?
[00:02] <vr34> i am able to run perceval on the mbox link given
[00:02] <vr34> i need to use perceval.backends.mbox while using it in a script rite?
[00:03] <vr34> also, to get archives analyzed - on what basis is this done?
[00:04] <vr34> also could you suggest a  step-by-step approach for this microtask?
[00:04] <jgbarah> Well, now it is perceval.backends.core.mbox, if I'm not wrong, but yes, right
[00:05] <jgbarah> The idea is: when you analyze with Perceval, you get a JSON document per message.
[00:05] <jgbarah> Those documents will be uploaded to ElasticSearch.
[00:06] <vr34> all documents to the same index?
[00:06] <jgbarah> Once they are in ElasticSearch, they will be annotated by thread
[00:06] <jgbarah> (Yes, all documents to the same index)
[00:07] <jgbarah> There is a well known algorithm for anotating the threads, we will use it
[00:07] <jgbarah> The annotation could be done directly before uploading to ElasticSearch, but that has problems,
[00:07] <jgbarah> such as that in many cases, threads spawn several archive files
[00:08] <jgbarah> So it is better first upload to ES, then retrieve the index and analyze
[00:08] <vr34> Oh okay..
[00:08] <jgbarah> For uploading / downloading to ES, you can use elasticsearch-dsl, a Python module for ES
[00:09] <jgbarah> For uploading,, you can try with uploading document by document, but if possible, use the bulk mode
[00:09] <jgbarah> (there is a helper module provided by elasticsearch-dsl for that)
[00:09] <vr34> sure, got it
[00:09] <jgbarah> For downloading, you could get document by document, maintaining state in your program while you annotate
[00:10] <jgbarah> And uploading in batches, once you re done (or just use the bulk helper with a Python generator)
[00:11] <jgbarah> The result should be a thread id for each message, which should be always the same.
[00:11] <jgbarah> For example, it could be the unique id of the first message in the thread
[00:12] <vr34> is this like a primary key for each of the messages?
[00:12] <jgbarah> (I mean, the Message-ID of the first message in the thread)
[00:13] <jgbarah> Each email message should have a Message-ID field, which should be unique. That one.
[00:13] <vr34> okay
[00:13] <jgbarah> Example: Message-id: <6c195e50-0fae-5008-4f34-df5bc7231d38@xxxxxxxxxxxx>
[00:14] <jgbarah> Maybe more clear now?
[00:14] <vr34> Yes, a lot clear now!
[00:14] <vr34> thanks a lot
[00:14] <jgbarah> Great! You're welcome!
[00:14] <vr34> reg the microtask
[00:15] <vr34> could you suggest what i could work on everyday?
[00:16] <vr34> i can start with generating the json file output from perceval
[00:16] <vr34> and dsl today
[00:17] <jgbarah> Yes, please. You can organize as you may want, and I will be happy to receive your progress messages every day, or when you feel connfortable
[00:17] <jgbarah> The pace depends on you.
[00:17] <vr34> okay, sure.
[00:17] <jgbarah> I would start by writing a simple script parsing a file, given its url (or if its fille name, if iyou prefer)
[00:18] <jgbarah> You have an example in the GrimoireLab training manual
[00:18] <jgbarah> Then, I would improve the script to upload the documents to ES
[00:19] <jgbarah> Then, I would write a script to download documents, annotate each with anything, and re-upload them again to a new index
[00:19] <jgbarah> Just to become familiar with ES and elasticsearch-dsl, and if possible with the bulk mode
[00:19] <jgbarah> Then, I would improve that script to run the threading algorithm
[00:20] <jgbarah> And when you're done with that, you're done ;-)
[00:20] <vr34> Ah okay!
[00:20] <jgbarah> Anything else?
[00:20] <vr34> That's about it! Thank you so much!
[00:21] <jgbarah> Oh, I forgot to mention: this should work kwith Python3 and ES 5.3, if possible
[00:21] <vr34> Right
[00:21] <jgbarah> For Perceval, use the latest version available via pip
[00:21] <jgbarah> Very likely there is going to bbe a new one during the weekend
[00:22] <vr34> Oh okay, i will update it to the latest version
[00:22] <jgbarah> And a final note: for transparency and reference, please get the log of this session, and send it to the xen mailing list
[00:22] <vr34> another question
[00:22] <jgbarah> copying Lars and myself
[00:22] <jgbarah> Yes please
[00:22] <vr34> Sure
[00:23] <vr34> would we be using logstash in this project for parsing logs? i didnt see any mention of it anywhere
[00:23] <jgbarah> No, we use Perceval to parse (mbox files in this case), and then upload directly with Python
[00:24] <jgbarah> You can say that you´re writing your own LS ;-)
[00:24] <vr34> haha okay!
[00:24] <jgbarah> Nothing else on my side. Anything else from you?
[00:25] <vr34> do i give updates on irc/mail?
[00:25] <jgbarah> Please update by mail, since that works asynchronously, and ping me on irc whenever you find me, if you need it
[00:25] <vr34> Sure, thanks a lot!
[00:25] <jgbarah> We can schedule irc slots when you need.
[00:26] <jgbarah> Thanks to you for your interest with this project
[00:26] <jgbarah> See you!
[00:26] <vr34> and thanks for helping me contribute! See you !
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.