[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 21/24] ARM: vITS: handle INVALL command



On 03/12/16 00:46, Stefano Stabellini wrote:
> On Fri, 2 Dec 2016, Andre Przywara wrote:
>> Hi,

Hi Stefano,

I started to answer this email some days ago, but then spend some time
on actually implementing what I suggested, hence the delay ...

>>
>> sorry for chiming in late ....
>>
>> I've been spending some time thinking about this, and I think we can in
>> fact get away without ever propagating command from domains to the host.
>>
>> I made a list of all commands that possible require host ITS command
>> propagation. There are two groups:
>> 1: enabling/disabling LPIs: INV, INVALL
>> 2: mapping/unmapping devices/events: DISCARD, MAPD, MAPTI.
>>
>> The second group can be handled by mapping all required devices up
>> front, I will elaborate on that in a different email.
>>
>> For the first group, read below ...
>>
>> On 01/12/16 01:19, Stefano Stabellini wrote:
>>> On Fri, 25 Nov 2016, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> On 18/11/16 18:39, Stefano Stabellini wrote:
>>>>> On Fri, 11 Nov 2016, Stefano Stabellini wrote:
>>>>>> On Fri, 11 Nov 2016, Julien Grall wrote:
>>>>>>> On 10/11/16 20:42, Stefano Stabellini wrote:
>>>>>>> That's why in the approach we had on the previous series was "host ITS
>>>>>>> command
>>>>>>> should be limited when emulating guest ITS command". From my recall, in
>>>>>>> that
>>>>>>> series the host and guest LPIs was fully separated (enabling a guest
>>>>>>> LPIs was
>>>>>>> not enabling host LPIs).
>>>>>>
>>>>>> I am interested in reading what Ian suggested to do when the physical
>>>>>> ITS queue is full, but I cannot find anything specific about it in the
>>>>>> doc.
>>>>>>
>>>>>> Do you have a suggestion for this?
>>>>>>
>>>>>> The only things that come to mind right now are:
>>>>>>
>>>>>> 1) check if the ITS queue is full and busy loop until it is not 
>>>>>> (spin_lock
>>>>>> style)
>>>>>> 2) check if the ITS queue is full and sleep until it is not (mutex style)
>>>>>
>>>>> Another, probably better idea, is to map all pLPIs of a device when the
>>>>> device is assigned to a guest (including Dom0). This is what was written
>>>>> in Ian's design doc. The advantage of this approach is that Xen doesn't
>>>>> need to take any actions on the physical ITS command queue when the
>>>>> guest issues virtual ITS commands, therefore completely solving this
>>>>> problem at the root. (Although I am not sure about enable/disable
>>>>> commands: could we avoid issuing enable/disable on pLPIs?)
>>>>
>>>> In the previous design document (see [1]), the pLPIs are enabled when the
>>>> device is assigned to the guest. This means that it is not necessary to 
>>>> send
>>>> command there. This is also means we may receive a pLPI before the 
>>>> associated
>>>> vLPI has been configured.
>>>>
>>>> That said, given that LPIs are edge-triggered, there is no deactivate state
>>>> (see 4.1 in ARM IHI 0069C). So as soon as the priority drop is done, the 
>>>> same
>>>> LPIs could potentially be raised again. This could generate a storm.
>>>
>>> Thank you for raising this important point. You are correct.
>>>
>>>> The priority drop is necessary if we don't want to block the reception of
>>>> interrupt for the current physical CPU.
>>>>
>>>> What I am more concerned about is this problem can also happen in normal
>>>> running (i.e the pLPI is bound to an vLPI and the vLPI is enabled). For
>>>> edge-triggered interrupt, there is no way to prevent them to fire again. 
>>>> Maybe
>>>> it is time to introduce rate-limit interrupt for ARM. Any opinions?
>>>
>>> Yes. It could be as simple as disabling the pLPI when Xen receives a
>>> second pLPI before the guest EOIs the first corresponding vLPI, which
>>> shouldn't happen in normal circumstances.
>>>
>>> We need a simple per-LPI inflight counter, incremented when a pLPI is
>>> received, decremented when the corresponding vLPI is EOIed (the LR is
>>> cleared).
>>>
>>> When the counter > 1, we disable the pLPI and request a maintenance
>>> interrupt for the corresponding vLPI.
>>
>> So why do we need a _counter_? This is about edge triggered interrupts,
>> I think we can just accumulate all of them into one.
> 
> The counter is not to re-inject the same amount of interrupts into the
> guest, but to detect interrupt storms.

I was wondering if an interrupt "storm" could already be defined by
"receiving an LPI while there is already one pending (in the guest's
virtual pending table) and it being disabled by the guest". I admit that
declaring two interrupts as a storm is a bit of a stretch, but in fact
the guest had probably a reason for disabling it even though it
fires, so Xen should just follow suit.
The only difference is that we don't do it _immediately_ when the guest
tells us (via INV), but only if needed (LPI actually fires).

>> So here is what I think:
>> - We use the guest provided pending table to hold a pending bit for each
>> VLPI. We can unmap the memory from the guest, since software is not
>> supposed to access this table as per the spec.
>> - We use the guest provided property table, without trapping it. There
>> is nothing to be "validated" in that table, since it's a really tight
>> encoding and every value written in there is legal. We only look at bit
>> 0 for this exercise here anyway.
> 
> I am following...
> 
> 
>> - Upon reception of a physical LPI, we look it up to find the VCPU and
>> virtual LPI number. This is what we need to do anyway and it's a quick
>> two-level table lookup at the moment.
>> - We use the VCPU's domain and the VLPI number to index the guest's
>> property table and read the enabled bit. Again a quick table lookup.
> 
> They should be both O(2), correct?

The second is even O(1). And even the first one could be a single table,
if desperately needed.

>>  - If the VLPI is enabled, we EOI it on the host and inject it.
>>  - If the VLPI is disabled, we set the pending bit in the VCPU's
>>    pending table and EOI on the host - to allow other IRQs.
>> - On a guest INV command, we check whether that vLPI is now enabled:
>>  - If it is disabled now, we don't need to do anything.
>>  - If it is enabled now, we check the pending bit for that VLPI:
>>   - If it is 0, we don't do anything.
>>   - If it is 1, we inject the VLPI and clear the pending bit.
>> - On a guest INVALL command, we just need to iterate over the virtual
>> LPIs.
> 
> Right, much better.
> 
> 
>> If you look at the conditions above, the only immediate action is
>> when a VLPI gets enabled _and_ its pending bit is set. So we can do
>> 64-bit read accesses over the whole pending table to find non-zero words
>> and thus set bits, which should be rare in practice. We can store the
>> highest mapped VLPI to avoid iterating over the whole of the table.
>> Ideally the guest has no direct control over the pending bits, since
>> this is what the device generates. Also we limit the number of VLPIs in
>> total per guest anyway.
> 
> I wonder if we could even use a fully packed bitmask with only the
> pending bits, so 1 bit per vLPI, rather than 1 byte per vLPI. That would
> be a nice improvement.

The _pending_ table is exactly that: one bit per VLPI. So by doing a
64-bit read we cover 64 VLPIs. And normally if an LPI fires it will
probably be enabled (otherwise the guest would have disabled it in the
device), so we inject it and don't need this table. It's
really just for storing the pending status should an LPI arrive while
the guest had _disabled_ it. I assume this is rather rare, so the table
will mostly be empty: that's why I expect most reads to be 0 and the
iteration of the table to be very quick. As an additional optimization
we could store the highest and lowest virtually pending LPI, to avoid
scanning the whole table.

We can't do so much about the property table, though, because its layout
is described in the spec - in contrast to the ITS tables, which are IMPDEF.
But as we only need to do something if the LPI is _both_ enabled _and_
pending, scanning the pending table gives us a quite good filter
already. For the few LPIs that hit here, we can just access the right
byte in the property table.

>> If that still sounds like a DOS vector, we could additionally rate-limit
>> INVALLs, and/or track additions to the pending table after the last
>> INVALL: if there haven't been any new pending bits since the last scan,
>> INVALL is a NOP.
>>
>> Does that makes sense so far?
> 
> It makes sense. It should be OK.
> 
> 
>> So that just leaves us with this IRQ storm issue, which I am thinking
>> about now. But I guess this is not a show stopper given we can disable
>> the physical LPI if we sense this situation.
> 
> That is true and it's exactly what we should do.
> 
> 
>>> When we receive the maintenance interrupt and we clear the LR of the
>>> vLPI, Xen should re-enable the pLPI.

So I was thinking why you would need to wait for the guest to actually
EOI it?
Can't we end the interrupt storm condition at the moment the guest
enables the interrupt? LPIs are edge triggered and so a storm in the
past is easily merged into a single guest LPI once the guest enables it
again. From there on we inject every triggered LPI into the guest.

This special handling for the interrupt storm just stems from the fact
that have to keep LPIs enabled on the h/w interrupt controller level,
despite the guest having disabled it on it's own _virtual_ GIC. So once
the guest enables it again, we are in line with the current GICv2/GICv3,
aren't we? Do we have interrupt storm detection/prevention in the moment?
And please keep in mind that LPIs are _always_ edge triggered: So once
we EOI an LPI on the host, this "very same" LPI is gone from a h/w GIC
point of view, the next incoming interrupt must have been triggered by a
new interrupt condition in the device (new network packet, for
instance). In contrast to GICv2 this applies to _every_ LPI.

So I am not sure we should really care _too_ much about this (apart from
the "guest has disabled it" part): Once we assign a device to a guest,
we lose some control over the machine anyway and at least trust the
device to not completely block the system. I don't see how the ITS
differs in that respect from the GICv3/GICv2.


A quick update on my side: I implemented the scheme I described in my
earlier mail now and it boots to the Dom0 prompt on a fastmodel with an ITS:
- On receiving the PHYSDEVOP_manage_pci_add hypercall in
xen/arch/arm/physdev.c, we MAPD the device on the host, MAPTI a bunch of
interrupts and enable them. We keep them unassigned in our host
pLPI->VLPI table, so we discard them should they fire.
This hypercall is issued by Dom0 Linux before bringing up any PCI
devices, so it works even for Dom0 without any Linux changes. For DomUs
with PCI passthrough Dom0 is expected to issue this hypercall on behalf
of the to-be-created domain.
- When a guest (be it Dom0 or Domu) actually maps an LPI (MAPTI), we
just enter the virtual LPI number and the target VCPU in our pLPI-vLPI
table and be done. Should it fire now, we know where to inject it, but
refer to the enabled bit in the guest's property table before doing so.
- When a guest (be it Dom0 or DomU) enables or disabled an interrupt, we
don't do much, as we refer to the enable bit every time we want to
inject already. The only thing I actually do is to inject an LPI if
there is a virtual LPI pending and the LPI is now enabled.

I will spend some time next week on updating the design document,
describing the new approach. I hope it becomes a bit clearer then.

Cheers,
Andre.

>>> Given that the state of the LRs is sync'ed before calling gic_interrupt,
>>> we can be sure to know exactly in what state the vLPI is at any given
>>> time. But for this to work correctly, it is important to configure the
>>> pLPI to be delivered to the same pCPU running the vCPU which handles
>>> the vLPI (as it is already the case today for SPIs).
>>
>> Why would that be necessary?
> 
> Because the state of the LRs of other pCPUs won't be up to date: we
> wouldn't know for sure whether the guest EOI'ed the vLPI or not.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.