[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history


  • To: Jürgen Groß <jgross@xxxxxxxx>
  • From: Ian Jackson <ian.jackson@xxxxxxxxxx>
  • Date: Wed, 20 Nov 2019 17:54:30 +0000
  • Authentication-results: esa5.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=ian.jackson@xxxxxxxxxx; spf=Pass smtp.mailfrom=Ian.Jackson@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Wed, 20 Nov 2019 17:54:46 +0000
  • Ironport-sdr: zt9pxEgYYxMF0qkrzUt+Sdcy46/n+ldkZ+1z1NYaQBh+0ZNiWDIw9jJlnG94BMrrViCiVx033t BPp4e0YIb0uQlT0Kh++TAZ0DTIOJgpXT6X7dOzmZoMfLV2CXVYxRt0HUGcH+CcMVc6nFg549S8 8IvBbrDcLAiduC/DcF/+cx/7xcvkCxopMHNFbqw3TcYAOerOpvFvYUn5Dz7qxT04Zu0IoMAUM2 x4vwi/9EQ5OCwAajcNaNX46VmgSs7OGM96tyrD6ayBl32tGUdMT2hn56sfJWbJhWA3PnAOHTxt PWA=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi, I promised you to do a risk/benefit analysis on this series and
here is my report.  With your permission I plan to push it on Sunday
night or Monday morning, if you think that is a convenient time.


Summary:

There are three kinds of risk here:

* There is a nonneglible chance that these changes have a significant
  adverse performance impact on post-flight reporting, so that
  overall throughput is adversely affected.  I have tried to exclude
  it by both reasoning and testing but it remains a risk.

  I propose to deal with this risk by pushing the change to osstest
  pretest at the beginning of the week, so that when it makes it
  through the self-push gate I am around to monitor it.  I will check
  to see that it is DTRT, and, particularly, that the reporting is not
  overly slow.

* I expect a certain amount of additional delay during the
  transitional period, when some flights are using old code and some
  new code.

  I propose to deal with this issue by negotiating a good time to do
  this when we can afford to, effectively, lose a few hours'
  throughput.

* There is a pretty small chance that these changes breaks everything
  by causing all flights to crash during host reporting.

  This will be obvious, especially if I'm watching it all closely.
  If this happens it will need to be reverted.

If we decide this series is a problem, after it has gone into
production, we can simply revert it.  There is nothing else in the
osstest push gate right now.  The old code will still function and we
could confidently force push it.

The upside of this change is to undo a regression in our ability to
diagnose host problems.  Particularly, if a host has a low probability
or intermittent fault, we will want to be able to look further back
than the current ~200 jobs (not sure how long that is without looking
it up but it is only a few days I think, at least for some hosts).

Ian.


Patch-by-patch notes:


sg-report-host-history: Improve debugging output

This is just additional prints.  If they accidentally refer to wrong
variables, this would generate perl nonfatal warnings in debug mode
(which we do not use in production).


sg-report-host-history: New --no-install option for testing

By inspection and testing this code does nothing if the new option is
not passed.


sg-report-host-history: Move `computeflightsrange' after hosts

I double checked that global variables used and set by
computeflightsrange.  It uses and sets $flightlimit; nothing else uses
this and it is set by the option parser.  It uses $limit, which is
only set by the option parser.  It sets $minflight and $flightcond;
these are used only by mainquery, which still comes after
computeflightsrange.


sg-report-host-history: Actually honour $minflight

The effect of this is to limit the output from some of
sg-report-host-history's queries.  If this is wrong somehow the worst
case is that information would be missing from the host history
reports.  That information would be for flights earlier than a minimum
flight number, so it would be quite obvious.

In principle the code code have a bug which causes the queries to
fail, for example if the parameters or syntax are wrong.  But the new
syntax is unconditional and such a bug should therefore be spotted
during testing.


sg-report-host-history: Get job status from mainquery

This unconditionally joins the jobs table to the runvars table in the
`mainquery'.  (Unconditionality means the query syntax is right.)

The jobs table is much smaller.  A handful of empirical tests suggest
this change does not slow things down significantly.  It not
particularly likely, but it is possible that this will be different in
production.

The change to the $infoq is slightly confusing.  There is now a dummy
"AND ?!='X'" condition in the query.  Its purpose is to consume a
redundant job name argument which is not needed any more.  jobs are
never called X so this condition is always true.  Testing shows this
works.


sg-report-host-history: Add $cachekey argument to jobquery

This patch does nothing but add an unused argument.  Syntax errors and
missed call sites (even on non-taken paths) would be caught by perl.


sg-report-host-history: Store per-job query results in %$jr

This is quite complex.  It stores new data in a hash %$jr which is
about the size of the host history report.  Those host history reports
have limited size so we expect this to be OK from a performance point
of view.  If not, we would see slow sg-report-host-history processes
(see mitigation above).

In principle this code might cause perl errors and cause
sg-report-host-history to crash, maybe because of a wrong or undefined
reference.  But I have tested both the cache hit and cache miss cases.


sg-report-host-history: Write cache entries
sg-report-host-history: Write cache entries for tail, too

This dumps the data out to the HTML.  There is new fiddly quoting code
but it is largely unconditional so has been executed and tested, so it
will probably not crash entirely.  There remains a risk that the
quoting algorithm or something else is wrong and generates corrupted
HTML.  That would not be a crisis for us as users, but it might affect
the program's ability to read it in.  See the next section for that:


sg-report-host-history: Read cache entries

The biggest risk here is that the logfile parser which reads the cache
entries finds something it doesn't like and crashes, refusing to parse
it.

If this occurs it is because of strange data in the osstest database:
weird job names or something, which trigger quoting/unquoting bugs.
But this code has been manually tested on existing recent data.  So
existing data is good.  And we aren't making new changes to osstest.


sg-report-host-history: Move job runvars query later

This is fine because it just sets local (my) variables.  Perl would
notice if we had got things wrong.


sg-report-host-history: Cache runvar queries (power information)

This relies on the changes made so far and does not add significant
risks of its own.


Revert "sg-report-host-history: Reduce limit from 2000 to 200"

This is the purpose of the exercise.

The risk is that the changes are not sufficient to, in practice, give
adequate performance.  During the transition (while some jobs are
using new code and some old) there will be some delays as things are
needlessly regenerated, but afterwards all should be well.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.