[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC] pass-through: sync pir to irr after msix vector been updated



On 13.09.2019 18:38, Joe Jin wrote:
> Hi Jan,
> 
> Thanks for your reply, see my reply in line please.
> 
> On 9/13/19 12:14 AM, Jan Beulich wrote:
>> On 12.09.2019 20:03, Joe Jin wrote:
>>> With below testcase, guest kernel reported "No irq handler for vector":
>>>   1). Passthrough mlx ib VF to 2 pvhvm guests.
>>>   2). Start rds-stress between 2 guests.
>>>   3). Scale down 2 guests vcpu from 32 to 6 at the same time.
>>>
>>> Repeat above test several iteration, guest kernel reported "No irq handler
>>> for vector", and IB traffic downed to zero which caused by interrupt lost.
>>>
>>> When vcpu offline, kernel disabled local IRQ, migrate IRQ to other cpu,
>>> update MSI-X table, enable IRQ. If any new interrupt arrived after
>>> local IRQ disabled also before MSI-X table been updated, interrupt still 
>>> used old vector and dest cpu info, and when local IRQ enabled again, 
>>> interrupt been sent to wrong cpu and vector.
>>>
>>> Looks sync PIR to IRR after MSI-X been updated is help for this issue.
>>
>> I'm having trouble making the connection, which quite possibly simply
>> means the description needs to be further extended: Sync-ing PIR to
>> IRR has nothing to do with a vector change. It would help if nothing
>> else caused this bitmap propagation, and an interrupt was lost (rather
>> than delivered through the wrong vector, or to the wrong CPU).
>> Furthermore with vector and destination being coupled, after a CPU has
>> been offlined this would generally mean
>> - if there was just a single destination permitted, lack of delivery
>>   altogether,
>> - if there were multiple destinations permitted, delivery to one of
>>   the other CPUs, at which point the vector would still be valid.
> 
> When cpu offline on guest kernel, it only migrates IRQs which affinity set
> to the cpu only, if multiple destinations, kernel does not do migration
> which included update msi-x table with new destination also vector.
> 
> After IRQ migration, kernel will check all vector's IRR, if APIC IRR
> been set, retrigger the IRQ to new destination. This intend to avoid
> to lost any interrupt.
> 
> But on Xen, after msi-x table updated, it never tried to check and notify
> guest kernel there was pending IRQ.
> 
>>
>> An interesting aspect would be on which CPU the log message was
>> observed, and how this correlates with the destination sets of the
>> CPUs that have got offlined. From there it would then further be
>> interesting to understand why the interrupt made it to that CPU,
>> since - as said - destination and vector get changed together, and
>> hence with things going wrong it would be of interest to know whether
>> the CPU receiving the IRQ is within the new destination set, or some
>> (random?) other one.
> 
> irq_retrigger() been called after kernel updated vector, irq_retrigger()
> will send pending IRQ(s) to new destination.
> 
> Here are kernel log when issue happened, guest kernel is 4.1, on 4.14
> guest, it's almost same, but no "(irq -1)" for kernel changes, IRQ
> migrations workflow are same(fixup_irqs()):
> 
> Sep 12 20:26:46 localhost kernel: smpboot: CPU 17 is now offline
> Sep 12 20:26:46 localhost kernel: smpboot: CPU 18 is now offline
> Sep 12 20:26:46 localhost kernel: smpboot: CPU 19 is now offline
> Sep 12 20:26:47 localhost kernel: Broke affinity for irq 251
> Sep 12 20:26:47 localhost kernel: do_IRQ: 20.178 No irq handler for vector 
> (irq -1)
> Sep 12 20:26:47 localhost kernel: smpboot: CPU 20 is now offline
> Sep 12 20:26:47 localhost kernel: smpboot: CPU 21 is now offline
> 
> From above, you can see IRQ sent to cpu 20, which is offlining.
> 
> IRQ arrived to the cpu immediately when IRQ enabled, after CPU offlined,
> it prints log "smpboot: CPU 20 is now offline".
> 
> Call path in kernel as below:
> cpu_down()
>   |-> cpu_down_maps_locked()
>   |     _cpu_down
>   |       |-> __stop_machine
>   |             |-> stop_cpus()
>   |                   |->__stop_cpus()
>   |                        |- queue_stop_cpus_work() ---+
>   |-> __cpu_die()                                       |
>          |-> pr_info("CPU %u is now offline\n", cpu);   |
>      +--------------------------------------------------+
>      |
>      +
> multi_cpu_stop()
>   |-> local_save_flags
>   |-> take_cpu_down()
>   |      |-> __cpu_disable
>   |            |-> smp_ops.cpu_disable = xen_cpu_disable
>   |                  |-> cpu_disable_common
>   |                        |-> fixup_irqs <== IRQ migration.
>   |-> local_irq_restore()

Ah yes, this makes sense. You want to extend the description to
reflect some of the further explanation above.

>>> --- a/xen/drivers/passthrough/io.c
>>> +++ b/xen/drivers/passthrough/io.c
>>> @@ -412,6 +412,9 @@ int pt_irq_create_bind(
>>>                  pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
>>>                  pirq_dpci->gmsi.gflags = gflags;
>>>              }
>>> +
>>> +            if ( hvm_funcs.sync_pir_to_irr )
>>> +                
>>> hvm_funcs.sync_pir_to_irr(d->vcpu[pirq_dpci->gmsi.dest_vcpu_id]);
>>
>> If the need for this change can be properly explained, then it
>> still wants converting to alternative_vcall() - the the other
>> caller of this hook. Or perhaps even better move vlapic.c's
>> wrapper (suitably renamed) into hvm.h, and use it here.
> 
> Yes I agree, I'm not 100% sure, so I set it to RFC.

And btw, please also attach a brief comment here, to clarify
why the syncing is needed precisely at this point.

>> Additionally, the code setting pirq_dpci->gmsi.dest_vcpu_id
>> (right after your code insertion) allows for the field to be
>> invalid, which I think you need to guard against.
> 
> I think you means multiple destination, then it's -1?

The reason for why it might be -1 are irrelevant here, I think.
You need to handle the case both to avoid an out-of-bounds
array access and to make sure an IRR bit wouldn't still get
propagated too late in some special case.

Also - what about the respective other path in the function,
dealing with PT_IRQ_TYPE_PCI and PT_IRQ_TYPE_MSI_TRANSLATE? It
seems to me that there's the same chance of deferring IRR
propagation for too long?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.