[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling

On 08/03/16 17:26, Jan Beulich wrote:
>>>> On 08.03.16 at 18:05, <george.dunlap@xxxxxxxxxx> wrote:
>> On 08/03/16 15:42, Jan Beulich wrote:
>>>>>> On 08.03.16 at 15:42, <George.Dunlap@xxxxxxxxxxxxx> wrote:
>>>> On Tue, Mar 8, 2016 at 1:10 PM, Wu, Feng <feng.wu@xxxxxxxxx> wrote:
>>>>>> -----Original Message-----
>>>>>> From: George Dunlap [mailto:george.dunlap@xxxxxxxxxx]
>>>>>> 2. Try to test engineered situations where we expect this to be a
>>>>>> problem, to see how big of a problem it is (proving the theory to be
>>>>>> accurate or inaccurate in this case)
>>>>> Maybe we can run a SMP guest with all the vcpus pinned to a dedicated
>>>>> pCPU, we can run some benchmark in the guest with VT-d PI and without
>>>>> VT-d PI, then see the performance difference between these two sceanrios.
>>>> This would give us an idea what the worst-case scenario would be.
>>> How would a single VM ever give us an idea about the worst
>>> case? Something getting close to worst case is a ton of single
>>> vCPU guests all temporarily pinned to one and the same pCPU
>>> (could be multi-vCPU ones, but the more vCPU-s the more
>>> artificial this pinning would become) right before they go into
>>> blocked state (i.e. through one of the two callers of
>>> arch_vcpu_block()), the pinning removed while blocked, and
>>> then all getting woken at once.
>> Why would removing the pinning be important?
> It's not important by itself, other than to avoid all vCPU-s then
> waking up on the one pCPU.
>> And I guess it's actually the case that it doesn't need all VMs to
>> actually be *receiving* interrupts; it just requires them to be
>> *capable* of receiving interrupts, for there to be a long chain all
>> blocked on the same physical cpu.
> Yes.
>>>>  But
>>>> pinning all vcpus to a single pcpu isn't really a sensible use case we
>>>> want to support -- if you have to do something stupid to get a
>>>> performance regression, then I as far as I'm concerned it's not a
>>>> problem.
>>>> Or to put it a different way: If we pin 10 vcpus to a single pcpu and
>>>> then pound them all with posted interrupts, and there is *no*
>>>> significant performance regression, then that will conclusively prove
>>>> that the theoretical performance regression is of no concern, and we
>>>> can enable PI by default.
>>> The point isn't the pinning. The point is what pCPU they're on when
>>> going to sleep. And that could involve quite a few more than just
>>> 10 vCPU-s, provided they all sleep long enough.
>>> And the "theoretical performance regression is of no concern" is
>>> also not a proper way of looking at it, I would say: Even if such
>>> a situation would happen extremely rarely, if it can happen at all,
>>> it would still be a security issue.
>> What I'm trying to get at is -- exactly what situation?  What actually
>> constitutes a problematic interrupt latency / interrupt processing
>> workload, how many vcpus must be sleeping on the same pcpu to actually
>> risk triggering that latency / workload, and how feasible is it that
>> such a situation would arise in a reasonable scenario?
>> If 200us is too long, and it only takes 3 sleeping vcpus to get there,
>> then yes, there is a genuine problem we need to try to address before we
>> turn it on by default.  If we say that up to 500us is tolerable, and it
>> takes 100 sleeping vcpus to reach that latency, then this is something I
>> don't really think we need to worry about.
>> "I think something bad may happen" is a really difficult to work with.
> I understand that, but coming up with proper numbers here isn't
> easy. Fact is - it cannot be excluded that on a system with
> hundreds of pCPU-s and thousands or vCPU-s, that all vCPU-s
> would at some point pile up on one pCPU's list.

So it's already the case that when a vcpu is woken, it is inserted into
the runqueue by priority order, both for credit1 and credit2; and this
is an insertion sort, so the amount of time it takes to do the insert is
expected to be the time it takes to traverse half of the list.  This
isn't an exact analog, because in that case it's the number of
*runnable* vcpus, not the number of *blocked* vcpus; but it demonstrates
the point that 1) we already have code that assumes that walking a list
of vcpus per pcpu is a reasonably bounded thing 2) we have years of no
major performance problems reported to back that assumption up.

I guess the slight difference there is that it's already well-understood
that too many *active* vcpus will overload your system and slow things
down; in the case of the pi wake-ups, the problem is that too many
*inactive* vcpus will overload your system and slow things down.

Still -- I have a hard time constructing in my mind a scenario where
huge numbers of idle vcpus for some reason decide to congregate on a
single pcpu.

Suppose we had 1024 pcpus, and 1023 VMs each with 5 vcpus, of which 1
was spinning at 100% and the other 4 were idle.  I'm not seeing a
situation where any of the schedulers put all (1023*4) idle vcpus on a
single pcpu.

For the credit1 scheduler, I'm basically positive that it can't happen
even once, even by chance.  You'd never be able to accrete more than a
dozen vcpus on that one pcpu before they were stolen away.

For the credit2 scheduler, it *might* be possible that if the busy vcpu
on each VM never changes (which itself is pretty unlikely), *and* the
sum of the "load" for all (1023*4) idle vcpus was less than 1 (i.e.,
idle vcpus took less than 0.02% of the cpu time), then you *might*
possibly after a long time end up at a situation where you had all vcpus
on a single pcpu.  But that "accretion" process would take a very long
time; and as soon as any vcpu had a brief "spike" above the "0.02%", a
whole bunch of them get moved somewhere else.

And in any case, are you really going to have 1023 devices so that you
can hand one to each of those 1023 guests?  Because it's only vcpus of
VMs *which have a device assigned* that end up on the block list.

If I may go "meta" for a moment here -- this is exactly what I'm talking
about with "Something bad may happen" being difficult to work with.
Rather than you spelling out exactly the situation which you think may
happen, (which I could then either accept or refute on its merits) *I*
am now spending a lot of time and effort trying to imagine what
situations you may be talking about and then refuting them myself.

If you have concerns, you need to make those concerns concrete, or at
least set clear criteria for how someone could go about addressing your
concerns.  And yes, it is *your* job, as the person doing the objecting
(and even moreso as the x86 maintainer), to make your concerns explicit
and/or set those criteria, and not Feng's job (or even my job) to try to
guess what it is might make you happy.

> How many would be tolerable on a single list depends upon host
> characteristics, so a fixed number won't do anyway. 

Sure, but if we can run through a list of 100 vcpus in 25us on a typical
server, then we can be pretty certain 100 vcpus will never exceed 500us
on basically any server.

On the other hand, if 50 vcpus takes 500us on whatever server Feng uses
for his tests, then yes, we don't really have enough "slack" to be sure
that we won't run into problems at some point.

But at this point we're just pulling numbers out of the air -- when we
have actual data we can make a better judgement about what might or
might not be acceptable.

> Hence I
> think the better approach, instead of improving lookup, is to
> distribute vCPU-s evenly across lists. Which in turn would likely
> require those lists to no longer be tied to pCPU-s, an aspect I
> had already suggested during review. As soon as distribution
> would be reasonably even, the security concern would vanish:
> Someone placing more vCPU-s on a host than that host can
> handle is responsible for the consequences. Quite contrary to
> someone placing more vCPU-s on a host than a single pCPU can
> reasonably handle in an interrupt handler.

I don't really understand your suggestion.  The PI interrupt is
necessarily tied to a specific pcpu; unless we start having multiple PI
interrupts, we only have as many interrupts as we have pcpus, right?
Are you saying that rather than put vcpus on the list of the pcpu it's
running on, we should set the interrupt to that of an arbitrary pcpu
that happens to have room on its list?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.