Xen project Mailing List

Re: [Xen-devel] [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling

To: "Tian, Kevin" <kevin.tian@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Date: Fri, 11 Dec 2015 01:58:44 +0000

Accept-language: en-US

Cc: "Wu, Feng" <feng.wu@xxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Keir Fraser <keir@xxxxxxx>

Delivery-date: Fri, 11 Dec 2015 01:58:55 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHRLagdOHPvyUqH0Eeui5ASGXlMn57EHLtwgAD0yDA=

Thread-topic: [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling

> -----Original Message----- > From: Tian, Kevin > Sent: Thursday, December 10, 2015 7:40 PM > To: Wu, Feng <feng.wu@xxxxxxxxx>; xen-devel@xxxxxxxxxxxxx > Cc: Keir Fraser <keir@xxxxxxx>; Jan Beulich <jbeulich@xxxxxxxx>; Andrew > Cooper <andrew.cooper3@xxxxxxxxxx>; George Dunlap > <george.dunlap@xxxxxxxxxxxxx>; Dario Faggioli <dario.faggioli@xxxxxxxxxx> > Subject: RE: [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling > > > From: Wu, Feng > > Sent: Thursday, December 03, 2015 4:36 PM > > > > This is the core logic handling for VT-d posted-interrupts. Basically it > > deals with how and when to update posted-interrupts during the following > > scenarios: > > - vCPU is preempted > > - vCPU is slept > > - vCPU is blocked > > > > When vCPU is preempted/slept, we update the posted-interrupts during > > scheduling by introducing two new architecutral scheduler hooks: > > vmx_pi_switch_from() and vmx_pi_switch_to(). When vCPU is blocked, we > > introduce a new architectural hooks: arch_vcpu_block() to update > > hooks -> hook > > > posted-interrupts descriptor. > > > > Besides that, before VM-entry, we will make sure the 'NV' filed is set > > to 'posted_intr_vector' and the vCPU is not in any blocking lists, which > > is needed when vCPU is running in non-root mode. The reason we do this > check > > is because we change the posted-interrupts descriptor in vcpu_block(), > > however, we don't change it back in vcpu_unblock() or when vcpu_block() > > directly returns due to event delivery (in fact, we don't need to do it > > in the two places, that is why we do it before VM-Entry). > > > > When we handle the lazy context switch for the following two scenarios: > > - Preempted by a tasklet, which uses in an idle context. > > - the prev vcpu is in offline and no new available vcpus in run queue. > > We don't change the 'SN' bit in posted-interrupt descriptor, this > > may incur spurious PI notification events, but since PI notification > > event is only sent when 'ON' is clear, and once the PI notification > > is sent, ON is set by hardware, hence no more notification events > > before 'ON' is clear. Besides that, spurious PI notification events are > > going to happen from time to time in Xen hypervisor, such as, when > > guests trap to Xen and PI notification event happens, there is > > nothing Xen actually needs to do about it, the interrupts will be > > delivered to guest atht the next time we do a VMENTRY. > > > > CC: Keir Fraser <keir@xxxxxxx> > > CC: Jan Beulich <jbeulich@xxxxxxxx> > > CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > > CC: Kevin Tian <kevin.tian@xxxxxxxxx> > > CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx> > > CC: Dario Faggioli <dario.faggioli@xxxxxxxxxx> > > Suggested-by: Yang Zhang <yang.z.zhang@xxxxxxxxx> > > Suggested-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> > > Suggested-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx> > > Suggested-by: Jan Beulich <jbeulich@xxxxxxxx> > > Signed-off-by: Feng Wu <feng.wu@xxxxxxxxx> > > --- > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > > index 6c2b512..3368cf2 100644 > > --- a/xen/arch/x86/hvm/hvm.c > > +++ b/xen/arch/x86/hvm/hvm.c > > @@ -7019,6 +7019,12 @@ void hvm_domain_soft_reset(struct domain *d) > > hvm_destroy_all_ioreq_servers(d); > > } > > > > +void arch_vcpu_block(struct vcpu *v) > > +{ > > + if ( v->arch.vcpu_block ) > > + v->arch.vcpu_block(v); > > +} > > + > > /* > > * Local variables: > > * mode: C > > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c > > index 000d06e..0f23fce 100644 > > --- a/xen/arch/x86/hvm/vmx/vmcs.c > > +++ b/xen/arch/x86/hvm/vmx/vmcs.c > > @@ -676,6 +676,8 @@ int vmx_cpu_up(void) > > if ( cpu_has_vmx_vpid ) > > vpid_sync_all(); > > > > + vmx_pi_per_cpu_init(cpu); > > + > > return 0; > > } > > > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > > index 39dc500..0d9462e 100644 > > --- a/xen/arch/x86/hvm/vmx/vmx.c > > +++ b/xen/arch/x86/hvm/vmx/vmx.c > > @@ -83,7 +83,131 @@ static int vmx_msr_write_intercept(unsigned int msr, > uint64_t > > msr_content); > > static void vmx_invlpg_intercept(unsigned long vaddr); > > static int vmx_vmfunc_intercept(struct cpu_user_regs *regs); > > > > +/* > > + * We maintain a per-CPU linked-list of vCPU, so in PI wakeup handler we > > vCPU -> vCPUs > > > + * can find which vCPU should be woken up. > > + */ > > +static DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu); > > +static DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock); > > + > > uint8_t __read_mostly posted_intr_vector; > > +uint8_t __read_mostly pi_wakeup_vector; > > + > > +void vmx_pi_per_cpu_init(unsigned int cpu) > > +{ > > + INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu)); > > + spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu)); > > +} > > + > > +void vmx_vcpu_block(struct vcpu *v) > > +{ > > + unsigned long flags; > > + struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > > + > > + if ( !has_arch_pdevs(v->domain) ) > > + return; > > + > > + ASSERT(v->arch.hvm_vmx.pi_block_cpu == NR_CPUS); > > + > > + /* > > + * The vCPU is blocking, we need to add it to one of the per pCPU > > lists. > > + * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for > > + * the per-CPU list, we also save it to posted-interrupt descriptor and > > + * make it as the destination of the wake-up notification event. > > the 2nd piece - "we also save it to posted-interrupt descriptor" is not > reflected within this function. Do you mean "we have saved it to..." > or "we will save it later to..." in other places? > > > + */ > > + v->arch.hvm_vmx.pi_block_cpu = v->processor; > > + > > + spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, > > + v->arch.hvm_vmx.pi_block_cpu), flags); > > + list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list, > > + &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu)); > > + spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, > > + v->arch.hvm_vmx.pi_block_cpu), flags); > > + > > + ASSERT(!pi_test_sn(pi_desc)); > > + > > + /* > > + * We don't need to set the 'NDST' field, since it should point to > > + * the same pCPU as v->processor. > > + */ > > + > > + write_atomic(&pi_desc->nv, pi_wakeup_vector); > > +} > > + > > +static void vmx_pi_switch_from(struct vcpu *v) > > +{ > > + struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > > + > > + if ( !iommu_intpost || !has_arch_pdevs(v->domain) || > > + test_bit(_VPF_blocked, &v->pause_flags) ) > > + return; > > + > > + /* > > + * The vCPU has been preempted or went to sleep. We don't need to send > > + * notification event to a non-running vcpu, the interrupt information > > I might be wrong on this, but my gut-feeling is that 'non-running' includes > both runnable and blocked vcpu... Yes, I should use "We don't need to send notification event to a runnable or sleeping vcpu ..." > > > + * will be delivered to it before VM-ENTRY when the vcpu is scheduled > > + * to run next time. > > + */ > > + pi_set_sn(pi_desc); > > +} > > + > > +static void vmx_pi_switch_to(struct vcpu *v) > > +{ > > + struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > > + > > + if ( !iommu_intpost || !has_arch_pdevs(v->domain) ) > > + return; > > + > > + if ( x2apic_enabled ) > > + write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor)); > > + else > > + write_atomic(&pi_desc->ndst, > > + MASK_INSR(cpu_physical_id(v->processor), > > + PI_xAPIC_NDST_MASK)); > > + > > + pi_clear_sn(pi_desc); > > +} > > + > > +static void vmx_pi_state_to_normal(struct vcpu *v) > > +{ > > + unsigned long flags; > > + unsigned int pi_block_cpu; > > + struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc; > > + > > + if ( !iommu_intpost || !has_arch_pdevs(v->domain) ) > > + return; > > + > > + ASSERT(!test_bit(_VPF_blocked, &v->pause_flags)); > > + > > + /* > > + * Set 'NV' field back to posted_intr_vector, so the > > + * Posted-Interrupts can be delivered to the vCPU when > > + * it is running in non-root mode. > > + */ > > + if ( pi_desc->nv != posted_intr_vector ) > > + write_atomic(&pi_desc->nv, posted_intr_vector); > > + > > + /* the vCPU is not on any blocking list. */ > > + pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu; > > + if ( pi_block_cpu == NR_CPUS ) > > + return; > > + > > + spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), flags); > > + > > + /* > > + * v->arch.hvm_vmx.pi_block_cpu == NR_CPUS here means the vCPU was > > + * removed from the blocking list while we are acquiring the lock. > > + */ > > + if ( v->arch.hvm_vmx.pi_block_cpu == NR_CPUS ) > > + { > > + spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, > > pi_block_cpu), > flags); > > + return; > > + } > > + > > + list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list); > > + v->arch.hvm_vmx.pi_block_cpu = NR_CPUS; > > + spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), > flags); > > +} > > > > static int vmx_domain_initialise(struct domain *d) > > { > > @@ -106,10 +230,17 @@ static int vmx_vcpu_initialise(struct vcpu *v) > > > > spin_lock_init(&v->arch.hvm_vmx.vmcs_lock); > > > > + INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list); > > + > > + v->arch.hvm_vmx.pi_block_cpu = NR_CPUS; > > + > > v->arch.schedule_tail = vmx_do_resume; > > v->arch.ctxt_switch_from = vmx_ctxt_switch_from; > > v->arch.ctxt_switch_to = vmx_ctxt_switch_to; > > > > + if ( iommu_intpost ) > > + v->arch.vcpu_block = vmx_vcpu_block; > > + > > if ( (rc = vmx_create_vmcs(v)) != 0 ) > > { > > dprintk(XENLOG_WARNING, > > @@ -734,6 +865,7 @@ static void vmx_ctxt_switch_from(struct vcpu *v) > > vmx_save_guest_msrs(v); > > vmx_restore_host_msrs(); > > vmx_save_dr(v); > > + vmx_pi_switch_from(v); > > } > > > > static void vmx_ctxt_switch_to(struct vcpu *v) > > @@ -758,6 +890,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v) > > > > vmx_restore_guest_msrs(v); > > vmx_restore_dr(v); > > + vmx_pi_switch_to(v); > > } > > > > > > @@ -2014,6 +2147,40 @@ static struct hvm_function_table __initdata > > vmx_function_table = { > > .altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc, > > }; > > > > +/* Handle VT-d posted-interrupt when VCPU is blocked. */ > > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs) > > +{ > > + struct arch_vmx_struct *vmx, *tmp; > > + struct vcpu *v; > > + spinlock_t *lock = &per_cpu(pi_blocked_vcpu_lock, smp_processor_id()); > > + struct list_head *blocked_vcpus = > > + &per_cpu(pi_blocked_vcpu, smp_processor_id()); > > + LIST_HEAD(list); > > what's it? Oh, this should be removed. I missed this while removing some code in this function from v9. > > > + > > + ack_APIC_irq(); > > + this_cpu(irq_count)++; > > + > > + spin_lock(lock); > > + > > + /* > > + * XXX: The length of the list depends on how many vCPU is current > > + * blocked on this specific pCPU. This may hurt the interrupt latency > > + * if the list grows to too many entries. > > + */ > > + list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocked_vcpu_list) > > + { > > + if ( pi_test_on(&vmx->pi_desc) ) > > + { > > + list_del(&vmx->pi_blocked_vcpu_list); > > + vmx->pi_block_cpu = NR_CPUS; > > + v = container_of(vmx, struct vcpu, arch.hvm_vmx); > > + vcpu_unblock(v); > > + } > > + } > > + > > + spin_unlock(lock); > > +} > > + > > I may miss something here. Who will clear ON bit? test read? There are two places where ON can get clear. - In vmx_sync_pir_to_irr() during sync interrupts information from PIR to IRR - If the guest is running in non-root, the CPU hardware will clear 'ON' when handling the notification event. No vm-exits in this case. Thanks, Feng > > It's unclear from design doc: > > +Outstanding Notification (ON): Indicate if there is a notification event > outstanding (not > +processed by processor or software) for this Posted Interrupt Descriptor. > When > this field is 0, > +hardware modifies it from 0 to 1 when generating a notification event, and > the > entity receiving > +the notification event (processor or software) resets it as part of posted > interrupt processing. > > > Thanks > Kevin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.