[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history
Hi, I promised you to do a risk/benefit analysis on this series and here is my report. With your permission I plan to push it on Sunday night or Monday morning, if you think that is a convenient time. Summary: There are three kinds of risk here: * There is a nonneglible chance that these changes have a significant adverse performance impact on post-flight reporting, so that overall throughput is adversely affected. I have tried to exclude it by both reasoning and testing but it remains a risk. I propose to deal with this risk by pushing the change to osstest pretest at the beginning of the week, so that when it makes it through the self-push gate I am around to monitor it. I will check to see that it is DTRT, and, particularly, that the reporting is not overly slow. * I expect a certain amount of additional delay during the transitional period, when some flights are using old code and some new code. I propose to deal with this issue by negotiating a good time to do this when we can afford to, effectively, lose a few hours' throughput. * There is a pretty small chance that these changes breaks everything by causing all flights to crash during host reporting. This will be obvious, especially if I'm watching it all closely. If this happens it will need to be reverted. If we decide this series is a problem, after it has gone into production, we can simply revert it. There is nothing else in the osstest push gate right now. The old code will still function and we could confidently force push it. The upside of this change is to undo a regression in our ability to diagnose host problems. Particularly, if a host has a low probability or intermittent fault, we will want to be able to look further back than the current ~200 jobs (not sure how long that is without looking it up but it is only a few days I think, at least for some hosts). Ian. Patch-by-patch notes: sg-report-host-history: Improve debugging output This is just additional prints. If they accidentally refer to wrong variables, this would generate perl nonfatal warnings in debug mode (which we do not use in production). sg-report-host-history: New --no-install option for testing By inspection and testing this code does nothing if the new option is not passed. sg-report-host-history: Move `computeflightsrange' after hosts I double checked that global variables used and set by computeflightsrange. It uses and sets $flightlimit; nothing else uses this and it is set by the option parser. It uses $limit, which is only set by the option parser. It sets $minflight and $flightcond; these are used only by mainquery, which still comes after computeflightsrange. sg-report-host-history: Actually honour $minflight The effect of this is to limit the output from some of sg-report-host-history's queries. If this is wrong somehow the worst case is that information would be missing from the host history reports. That information would be for flights earlier than a minimum flight number, so it would be quite obvious. In principle the code code have a bug which causes the queries to fail, for example if the parameters or syntax are wrong. But the new syntax is unconditional and such a bug should therefore be spotted during testing. sg-report-host-history: Get job status from mainquery This unconditionally joins the jobs table to the runvars table in the `mainquery'. (Unconditionality means the query syntax is right.) The jobs table is much smaller. A handful of empirical tests suggest this change does not slow things down significantly. It not particularly likely, but it is possible that this will be different in production. The change to the $infoq is slightly confusing. There is now a dummy "AND ?!='X'" condition in the query. Its purpose is to consume a redundant job name argument which is not needed any more. jobs are never called X so this condition is always true. Testing shows this works. sg-report-host-history: Add $cachekey argument to jobquery This patch does nothing but add an unused argument. Syntax errors and missed call sites (even on non-taken paths) would be caught by perl. sg-report-host-history: Store per-job query results in %$jr This is quite complex. It stores new data in a hash %$jr which is about the size of the host history report. Those host history reports have limited size so we expect this to be OK from a performance point of view. If not, we would see slow sg-report-host-history processes (see mitigation above). In principle this code might cause perl errors and cause sg-report-host-history to crash, maybe because of a wrong or undefined reference. But I have tested both the cache hit and cache miss cases. sg-report-host-history: Write cache entries sg-report-host-history: Write cache entries for tail, too This dumps the data out to the HTML. There is new fiddly quoting code but it is largely unconditional so has been executed and tested, so it will probably not crash entirely. There remains a risk that the quoting algorithm or something else is wrong and generates corrupted HTML. That would not be a crisis for us as users, but it might affect the program's ability to read it in. See the next section for that: sg-report-host-history: Read cache entries The biggest risk here is that the logfile parser which reads the cache entries finds something it doesn't like and crashes, refusing to parse it. If this occurs it is because of strange data in the osstest database: weird job names or something, which trigger quoting/unquoting bugs. But this code has been manually tested on existing recent data. So existing data is good. And we aren't making new changes to osstest. sg-report-host-history: Move job runvars query later This is fine because it just sets local (my) variables. Perl would notice if we had got things wrong. sg-report-host-history: Cache runvar queries (power information) This relies on the changes made so far and does not add significant risks of its own. Revert "sg-report-host-history: Reduce limit from 2000 to 200" This is the purpose of the exercise. The risk is that the changes are not sufficient to, in practice, give adequate performance. During the transition (while some jobs are using new code and some old) there will be some delays as things are needlessly regenerated, but afterwards all should be well. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |