Re: [Xen-devel] Ongoing/future speculative mitigation work

  To: Tamas K Lengyel
  From: Andrew Cooper
  Date: Thu, 25 Oct 2018 19:39:52 +0100
  Cc: Julien Grall, Jan Beulich, Stefano Stabellini, Daniel Kiper, Marek Marczykowski-Górecki, Lars Kurth, Konrad Rzeszutek Wilk, George Dunlap, Dario Faggioli, Matt Wilson, Boris Ostrovsky, Wei Liu, George Dunlap, Xen-devel, Roger Pau Monné
On 25/10/18 19:35, Tamas K Lengyel wrote:
> On Thu, Oct 25, 2018 at 12:13 PM Andrew Cooper
> <andrew.cooper3@xxxxxxxxxx> wrote:
>> On 25/10/18 18:58, Tamas K Lengyel wrote:
>>> On Thu, Oct 25, 2018 at 11:43 AM Andrew Cooper
>>> <andrew.cooper3@xxxxxxxxxx> wrote:
>>>> On 25/10/18 18:35, Tamas K Lengyel wrote:
>>>>> On Thu, Oct 25, 2018 at 11:02 AM George Dunlap <george.dunlap@xxxxxxxxxx> 
>>>>> wrote:
>>>>>> On 10/25/2018 05:55 PM, Andrew Cooper wrote:
>>>>>>> On 24/10/18 16:24, Tamas K Lengyel wrote:
>>>>>>>>> A solution to this issue was proposed, whereby Xen synchronises 
>>>>>>>>> siblings
>>>>>>>>> on vmexit/entry, so we are never executing code in two different
>>>>>>>>> privilege levels.  Getting this working would make it safe to continue
>>>>>>>>> using hyperthreading even in the presence of L1TF.  Obviously, its 
>>>>>>>>> going
>>>>>>>>> to come in perf hit, but compared to disabling hyperthreading, all its
>>>>>>>>> got to do is beat a 60% perf hit to make it the preferable option for
>>>>>>>>> making your system L1TF-proof.
>>>>>>>> Could you shed some light what tests were done where that 60%
>>>>>>>> performance hit was observed? We have performed intensive stress-tests
>>>>>>>> to confirm this but according to our findings turning off
>>>>>>>> hyper-threading is actually improving performance on all machines we
>>>>>>>> tested thus far.
>>>>>>> Aggregate inter and intra host disk and network throughput, which is a
>>>>>>> reasonable approximation of a load of webserver VM's on a single
>>>>>>> physical server.  Small packet IO was hit worst, as it has a very high
>>>>>>> vcpu context switch rate between dom0 and domU.  Disabling HT means you
>>>>>>> have half the number of logical cores to schedule on, which doubles the
>>>>>>> mean time to next timeslice.
>>>>>>> In principle, for a fully optimised workload, HT gets you ~30% extra due
>>>>>>> to increased utilisation of the pipeline functional units.  Some
>>>>>>> resources are statically partitioned, while some are competitively
>>>>>>> shared, and its now been well proven that actions on one thread can have
>>>>>>> a large effect on others.
>>>>>>> Two arbitrary vcpus are not an optimised workload.  If the perf
>>>>>>> improvement you get from not competing in the pipeline is greater than
>>>>>>> the perf loss from Xen's reduced capability to schedule, then disabling
>>>>>>> HT would be an improvement.  I can certainly believe that this might be
>>>>>>> the case for Qubes style workloads where you are probably not very
>>>>>>> overprovisioned, and you probably don't have long running IO and CPU
>>>>>>> bound tasks in the VMs.
>>>>>> As another data point, I think it was MSCI who said they always disabled
>>>>>> hyperthreading, because they also found that their workloads ran slower
>>>>>> with HT than without.  Presumably they were doing massive number
>>>>>> crunching, such that each thread was waiting on the ALU a significant
>>>>>> portion of the time anyway; at which point the superscalar scheduling
>>>>>> and/or reduction in cache efficiency would have brought performance from
>>>>>> "no benefit" down to "negative benefit".
>>>>> Thanks for the insights. Indeed, we are primarily concerned with
>>>>> performance of Qubes-style workloads which may range from
>>>>> no-oversubscription to heavily oversubscribed. It's not a workload we
>>>>> can predict or optimize before-hand, so we are looking for a default
>>>>> that would be 1) safe and 2) performant in the most general case
>>>>> possible.
>>>> So long as you've got the XSA-273 patches, you should be able to park
>>>> and re-reactivate hyperthreads using `xen-hptool cpu-{online,offline} 
>>>> $CPU`.
>>>> You should be able to effectively change hyperthreading configuration at
>>>> runtime.  It's not quite the same as changing it in the BIOS, but from a
>>>> competition of pipeline resources, it should be good enough.
>>> Thanks, indeed that is a handy tool to have. We often can't disable
>>> hyperthreading in the BIOS anyway because most BIOS' don't allow you
>>> to do that when TXT is used.
>> Hmm - that's an odd restriction.  I don't immediately see why such a
>> restriction would be necessary.
>>> That said, with this tool we still
>>> require some way to determine when to do parking/reactivation of
>>> hyperthreads. We could certainly park hyperthreads when we see the
>>> system is being oversubscribed in terms of number of vCPUs being
>>> active, but for real optimization we would have to understand the
>>> workloads running within the VMs if I understand correctly?
>> TBH, I'd perhaps start with an admin control which lets them switch
>> between the two modes, and some instructions on how/why they might want
>> to try switching.
>> Trying to second-guess the best HT setting automatically is most likely
>> going to be a lost cause.  It will be system specific as to whether the
>> same workload is better with or without HT.
> This may just not be practically possible at the end as the system
> administrator may have no idea what workload will be running on any
> given system. It may also vary between one user to the next on the
> same system, without the users being allowed to tune such details of
> the system. If we can show that with core-scheduling deployed for most
> workloads performance is improved by x % it may be a safe option. But
> if every system needs to be tuned and evaluated in terms of its
> eventual workload, that task becomes problematic. I appreciate the
> insights though!

To a first approximation, a superuser knob of "switch between single and
dual threaded mode" can be used by people to experiment as to which is
faster overall.

If it really is the case that disabling HT makes things faster, then
you've suddenly gained (almost-)core scheduling "for free" alongside
that perf improvement.


