Xen project Mailing List

[23:59] <vr34> Hi Jesus

[00:01] <jgbarah> Hi!

[00:01] <vr34> So, let me summarize what I have understood of the project?

[00:02] <vr34> i am able to run perceval on the mbox link given

[00:02] <vr34> i need to use perceval.backends.mbox while using it in a script rite?

[00:03] <vr34> also, to get archives analyzed - on what basis is this done?

[00:04] <vr34> also could you suggest a  step-by-step approach for this microtask?

[00:04] <jgbarah> Well, now it is perceval.backends.core.mbox, if I'm not wrong, but yes, right

[00:05] <jgbarah> The idea is: when you analyze with Perceval, you get a JSON document per message.

[00:05] <jgbarah> Those documents will be uploaded to ElasticSearch.

[00:06] <vr34> all documents to the same index?

[00:06] <jgbarah> Once they are in ElasticSearch, they will be annotated by thread

[00:06] <jgbarah> (Yes, all documents to the same index)

[00:07] <jgbarah> There is a well known algorithm for anotating the threads, we will use it

[00:07] <jgbarah> The annotation could be done directly before uploading to ElasticSearch, but that has problems,

[00:07] <jgbarah> such as that in many cases, threads spawn several archive files

[00:08] <jgbarah> So it is better first upload to ES, then retrieve the index and analyze

[00:08] <vr34> Oh okay..

[00:08] <jgbarah> For uploading / downloading to ES, you can use elasticsearch-dsl, a Python module for ES

[00:09] <jgbarah> For uploading,, you can try with uploading document by document, but if possible, use the bulk mode

[00:09] <jgbarah> (there is a helper module provided by elasticsearch-dsl for that)

[00:09] <vr34> sure, got it

[00:09] <jgbarah> For downloading, you could get document by document, maintaining state in your program while you annotate

[00:10] <jgbarah> And uploading in batches, once you re done (or just use the bulk helper with a Python generator)

[00:11] <jgbarah> The result should be a thread id for each message, which should be always the same.

[00:11] <jgbarah> For example, it could be the unique id of the first message in the thread

[00:12] <vr34> is this like a primary key for each of the messages?

[00:12] <jgbarah> (I mean, the Message-ID of the first message in the thread)

[00:13] <jgbarah> Each email message should have a Message-ID field, which should be unique. That one.

[00:13] <vr34> okay

[00:13] <jgbarah> Example: Message-id: <6c195e50-0fae-5008-4f34-df5bc7231d38@xxxxxxxxxxxx>

[00:14] <jgbarah> Maybe more clear now?

[00:14] <vr34> Yes, a lot clear now!

[00:14] <vr34> thanks a lot

[00:14] <jgbarah> Great! You're welcome!

[00:14] <vr34> reg the microtask

[00:15] <vr34> could you suggest what i could work on everyday?

[00:16] <vr34> i can start with generating the json file output from perceval

[00:16] <vr34> and dsl today

[00:17] <jgbarah> Yes, please. You can organize as you may want, and I will be happy to receive your progress messages every day, or when you feel connfortable

[00:17] <jgbarah> The pace depends on you.

[00:17] <vr34> okay, sure.

[00:17] <jgbarah> I would start by writing a simple script parsing a file, given its url (or if its fille name, if iyou prefer)

[00:18] <jgbarah> You have an example in the GrimoireLab training manual

[00:18] <jgbarah> Then, I would improve the script to upload the documents to ES

[00:19] <jgbarah> Then, I would write a script to download documents, annotate each with anything, and re-upload them again to a new index

[00:19] <jgbarah> Just to become familiar with ES and elasticsearch-dsl, and if possible with the bulk mode

[00:19] <jgbarah> Then, I would improve that script to run the threading algorithm

[00:20] <jgbarah> And when you're done with that, you're done ;-)

[00:20] <vr34> Ah okay!

[00:20] <jgbarah> Anything else?

[00:20] <vr34> That's about it! Thank you so much!

[00:21] <jgbarah> Oh, I forgot to mention: this should work kwith Python3 and ES 5.3, if possible

[00:21] <vr34> Right

[00:21] <jgbarah> For Perceval, use the latest version available via pip

[00:21] <jgbarah> Very likely there is going to bbe a new one during the weekend

[00:22] <vr34> Oh okay, i will update it to the latest version

[00:22] <jgbarah> And a final note: for transparency and reference, please get the log of this session, and send it to the xen mailing list

[00:22] <vr34> another question

[00:22] <jgbarah> copying Lars and myself

[00:22] <jgbarah> Yes please

[00:22] <vr34> Sure

[00:23] <vr34> would we be using logstash in this project for parsing logs? i didnt see any mention of it anywhere

[00:23] <jgbarah> No, we use Perceval to parse (mbox files in this case), and then upload directly with Python

[00:24] <jgbarah> You can say that you´re writing your own LS ;-)

[00:24] <vr34> haha okay!

[00:24] <jgbarah> Nothing else on my side. Anything else from you?

[00:25] <vr34> do i give updates on irc/mail?

[00:25] <jgbarah> Please update by mail, since that works asynchronously, and ping me on irc whenever you find me, if you need it

[00:25] <vr34> Sure, thanks a lot!

[00:25] <jgbarah> We can schedule irc slots when you need.

[00:26] <jgbarah> Thanks to you for your interest with this project

[00:26] <jgbarah> See you!

[00:26] <vr34> and thanks for helping me contribute! See you !

[Xen-devel] Project discussion log