[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH for-4.13 v4 2/3] x86/passthrough: fix migration of MSI when using posted interrupts



On 14.11.2019 15:42, Roger Pau Monné  wrote:
> On Thu, Nov 14, 2019 at 02:35:56PM +0100, Jan Beulich wrote:
>> On 13.11.2019 16:59, Roger Pau Monne wrote:
>>> +    for ( id = find_first_bit(vcpus, d->max_vcpus);
>>> +          id < d->max_vcpus;
>>> +          id = find_next_bit(vcpus, d->max_vcpus, id + 1) )
>>> +    {
>>> +        if ( d->vcpu[id] != current )
>>> +            vcpu_pause(d->vcpu[id]);
>>
>> Isn't this setting us up for a deadlock if two parties come here
>> for the same domain, and both on a vCPU belonging to that domain
>> (and with the opposite one's bit set in the bitmap)? But it looks
>> like d would never be the current domain here - you will want to
>> assert and comment on this, though. At that point the comparisons
>> against current can then go away as well afaict.
> 
> The above is true for syncs triggered by MSI changes that bounce to
> QEMU and then get forwarded to Xen as DOMCTLs, but AFAICT syncs that
> result from a vIO-APIC entry write (patch #3) will have v ==
> current.

Ah, yes. But let's please handle the needs of the vIO-APIC code in
that other patch. That'll also ...

> vIO-APIC writes however use the d->arch.hvm.irq_lock, so it's not
> possible to process multiple vCPUs vIO-APIC accesses at the same time.
> I'm afraid I don't know how which kind of assert should be added
> here. I could add a comment, but seems fragile.

... help with the assertions to add. You'd put the strict one in
here, as suggested, and then relax it there as needed. One option
would be to require callers in the context of the current domain
to acquire the lock you mention (holding of which you could assert
here, in case current->domain == d).

>>> @@ -420,20 +384,16 @@ int pt_irq_create_bind(
>>>          delivery_mode = MASK_EXTR(pirq_dpci->gmsi.gflags,
>>>                                    XEN_DOMCTL_VMSI_X86_DELIV_MASK);
>>>  
>>> -        dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>>> +        hvm_intr_get_dests(d, dest, dest_mode, delivery_mode, dest_vcpus);
>>> +        dest_vcpu_id = bitmap_weight(dest_vcpus, d->max_vcpus) != 1 ?
>>> +            -1 : find_first_bit(dest_vcpus, d->max_vcpus);
>>>          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
>>>          spin_unlock(&d->event_lock);
>>>  
>>>          pirq_dpci->gmsi.posted = false;
>>>          vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL;
>>> -        if ( iommu_intpost )
>>> -        {
>>> -            if ( delivery_mode == dest_LowestPrio )
>>> -                vcpu = vector_hashing_dest(d, dest, dest_mode,
>>> -                                           pirq_dpci->gmsi.gvec);
>>> -            if ( vcpu )
>>> -                pirq_dpci->gmsi.posted = true;
>>> -        }
>>> +        if ( vcpu && iommu_intpost )
>>> +            pirq_dpci->gmsi.posted = true;
>>
>> One aspect that I'm curious about: How much posting opportunity do
>> we lose in practice by no longer posting when the guest uses lowest
>> priority mode with multiple destinations?
> 
> Linux seems to use dest_Fixed exclusively, and the same goes to
> FreeBSD.

That's for recent Linux. Go back to 3.0 (just as an example), and
you'll see it was different back then. But yes, our decision here
would of course be influenced more by more recent guest OS versions.

>>> @@ -442,6 +402,9 @@ int pt_irq_create_bind(
>>>              pi_update_irte(vcpu ? &vcpu->arch.hvm.vmx.pi_desc : NULL,
>>>                             info, pirq_dpci->gmsi.gvec);
>>>  
>>> +        if ( hvm_funcs.deliver_posted_intr )
>>> +            domain_sync_vlapic_pir(d, prev_vcpus);
>>
>> Accessing hvm_funcs here looks like a layering violation. This
>> wants either moving into the function or (seeing the other use)
>> abstracting away. Seeing the conditional here (and below) I also
>> notice that you calculate prev_vcpus in vein when there's no
>> interrupt posting in use.
> 
> I could indeed only fill prev_vcpus when posting is in use.
> 
>> I guess together with the variable size issue mentioned above a
>> possible solution would be:
>> - have one bitmap hanging off of pirq_dpci->gmsi,
>> - have one bitmap per pCPU,
>> - populate the new destination bits into the per-pCPU one,
>> - issue the PIR->IRR sync,
>> - exchange the per-pCPU and per-DPCI pointers.
>> You could then leave the pointers at NULL when no posting is to
>> be used, addressing the apparent layering violation here at the
>> same time.
> 
> Right, the above option avoids having to calculate the possible
> destinations twice (once on setup and once on teardown), however it
> expands the size of gmsi.

Well, size of dynamically allocated structures is clearly secondary
compared to size of objects we put on the stack. (It had bothered
me from the beginning that the multiple destination information was
simply discarded, yet not discarding it would likely have meant
allocating such bitmaps anyway.)

> While here, I've also realized that interrupts injected using
> XEN_DMOP_inject_msi will also be posted, and I'm afraid there's no way
> to track and flush those unless we provide a posted-flush hypercall or
> some such, so that emulators can request a PIR flush when interrupts
> of fully emulated devices are reconfigured.
> 
> OTOH, maybe XEN_DMOP_inject_msi should pause the vCPU, set the bit in
> IRR and unpause?

Might be an option, yes. This op is - afaict - not for production
use anyway, so ought to be fine to be slow. (The pausing wouldn't
be needed when not posting the interrupt, iiuc.) Another option
(without having looked at the code) might be to suppress posting,
if that can be suitably requested through the interface layers.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.