[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling




> -----Original Message-----
> From: Tian, Kevin
> Sent: Thursday, December 10, 2015 7:40 PM
> To: Wu, Feng <feng.wu@xxxxxxxxx>; xen-devel@xxxxxxxxxxxxx
> Cc: Keir Fraser <keir@xxxxxxx>; Jan Beulich <jbeulich@xxxxxxxx>; Andrew
> Cooper <andrew.cooper3@xxxxxxxxxx>; George Dunlap
> <george.dunlap@xxxxxxxxxxxxx>; Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> Subject: RE: [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling
> 
> > From: Wu, Feng
> > Sent: Thursday, December 03, 2015 4:36 PM
> >
> > This is the core logic handling for VT-d posted-interrupts. Basically it
> > deals with how and when to update posted-interrupts during the following
> > scenarios:
> > - vCPU is preempted
> > - vCPU is slept
> > - vCPU is blocked
> >
> > When vCPU is preempted/slept, we update the posted-interrupts during
> > scheduling by introducing two new architecutral scheduler hooks:
> > vmx_pi_switch_from() and vmx_pi_switch_to(). When vCPU is blocked, we
> > introduce a new architectural hooks: arch_vcpu_block() to update
> 
> hooks -> hook
> 
> > posted-interrupts descriptor.
> >
> > Besides that, before VM-entry, we will make sure the 'NV' filed is set
> > to 'posted_intr_vector' and the vCPU is not in any blocking lists, which
> > is needed when vCPU is running in non-root mode. The reason we do this
> check
> > is because we change the posted-interrupts descriptor in vcpu_block(),
> > however, we don't change it back in vcpu_unblock() or when vcpu_block()
> > directly returns due to event delivery (in fact, we don't need to do it
> > in the two places, that is why we do it before VM-Entry).
> >
> > When we handle the lazy context switch for the following two scenarios:
> > - Preempted by a tasklet, which uses in an idle context.
> > - the prev vcpu is in offline and no new available vcpus in run queue.
> > We don't change the 'SN' bit in posted-interrupt descriptor, this
> > may incur spurious PI notification events, but since PI notification
> > event is only sent when 'ON' is clear, and once the PI notification
> > is sent, ON is set by hardware, hence no more notification events
> > before 'ON' is clear. Besides that, spurious PI notification events are
> > going to happen from time to time in Xen hypervisor, such as, when
> > guests trap to Xen and PI notification event happens, there is
> > nothing Xen actually needs to do about it, the interrupts will be
> > delivered to guest atht the next time we do a VMENTRY.
> >
> > CC: Keir Fraser <keir@xxxxxxx>
> > CC: Jan Beulich <jbeulich@xxxxxxxx>
> > CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> > CC: Kevin Tian <kevin.tian@xxxxxxxxx>
> > CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> > CC: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> > Suggested-by: Yang Zhang <yang.z.zhang@xxxxxxxxx>
> > Suggested-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> > Suggested-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> > Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
> > Signed-off-by: Feng Wu <feng.wu@xxxxxxxxx>
> > ---
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index 6c2b512..3368cf2 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -7019,6 +7019,12 @@ void hvm_domain_soft_reset(struct domain *d)
> >      hvm_destroy_all_ioreq_servers(d);
> >  }
> >
> > +void arch_vcpu_block(struct vcpu *v)
> > +{
> > +    if ( v->arch.vcpu_block )
> > +        v->arch.vcpu_block(v);
> > +}
> > +
> >  /*
> >   * Local variables:
> >   * mode: C
> > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> > index 000d06e..0f23fce 100644
> > --- a/xen/arch/x86/hvm/vmx/vmcs.c
> > +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> > @@ -676,6 +676,8 @@ int vmx_cpu_up(void)
> >      if ( cpu_has_vmx_vpid )
> >          vpid_sync_all();
> >
> > +    vmx_pi_per_cpu_init(cpu);
> > +
> >      return 0;
> >  }
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 39dc500..0d9462e 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -83,7 +83,131 @@ static int vmx_msr_write_intercept(unsigned int msr,
> uint64_t
> > msr_content);
> >  static void vmx_invlpg_intercept(unsigned long vaddr);
> >  static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
> >
> > +/*
> > + * We maintain a per-CPU linked-list of vCPU, so in PI wakeup handler we
> 
> vCPU -> vCPUs
> 
> > + * can find which vCPU should be woken up.
> > + */
> > +static DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
> > +static DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
> > +
> >  uint8_t __read_mostly posted_intr_vector;
> > +uint8_t __read_mostly pi_wakeup_vector;
> > +
> > +void vmx_pi_per_cpu_init(unsigned int cpu)
> > +{
> > +    INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
> > +    spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +}
> > +
> > +void vmx_vcpu_block(struct vcpu *v)
> > +{
> > +    unsigned long flags;
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +
> > +    if ( !has_arch_pdevs(v->domain) )
> > +        return;
> > +
> > +    ASSERT(v->arch.hvm_vmx.pi_block_cpu == NR_CPUS);
> > +
> > +    /*
> > +     * The vCPU is blocking, we need to add it to one of the per pCPU 
> > lists.
> > +     * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for
> > +     * the per-CPU list, we also save it to posted-interrupt descriptor and
> > +     * make it as the destination of the wake-up notification event.
> 
> the 2nd piece - "we also save it to posted-interrupt descriptor" is not
> reflected within this function. Do you mean "we have saved it to..."
> or "we will save it later to..." in other places?
> 
> > +     */
> > +    v->arch.hvm_vmx.pi_block_cpu = v->processor;
> > +
> > +    spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> > +                      v->arch.hvm_vmx.pi_block_cpu), flags);
> > +    list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> > +                  &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
> > +    spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> > +                           v->arch.hvm_vmx.pi_block_cpu), flags);
> > +
> > +    ASSERT(!pi_test_sn(pi_desc));
> > +
> > +    /*
> > +     * We don't need to set the 'NDST' field, since it should point to
> > +     * the same pCPU as v->processor.
> > +     */
> > +
> > +    write_atomic(&pi_desc->nv, pi_wakeup_vector);
> > +}
> > +
> > +static void vmx_pi_switch_from(struct vcpu *v)
> > +{
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +
> > +    if ( !iommu_intpost || !has_arch_pdevs(v->domain) ||
> > +         test_bit(_VPF_blocked, &v->pause_flags) )
> > +        return;
> > +
> > +    /*
> > +     * The vCPU has been preempted or went to sleep. We don't need to send
> > +     * notification event to a non-running vcpu, the interrupt information
> 
> I might be wrong on this, but my gut-feeling is that 'non-running' includes
> both runnable and blocked vcpu...

Yes, I should use "We don't need to send notification event to a runnable or
sleeping vcpu ..."

> 
> > +     * will be delivered to it before VM-ENTRY when the vcpu is scheduled
> > +     * to run next time.
> > +     */
> > +    pi_set_sn(pi_desc);
> > +}
> > +
> > +static void vmx_pi_switch_to(struct vcpu *v)
> > +{
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +
> > +    if ( !iommu_intpost || !has_arch_pdevs(v->domain) )
> > +        return;
> > +
> > +    if ( x2apic_enabled )
> > +        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
> > +    else
> > +        write_atomic(&pi_desc->ndst,
> > +                     MASK_INSR(cpu_physical_id(v->processor),
> > +                     PI_xAPIC_NDST_MASK));
> > +
> > +    pi_clear_sn(pi_desc);
> > +}
> > +
> > +static void vmx_pi_state_to_normal(struct vcpu *v)
> > +{
> > +    unsigned long flags;
> > +    unsigned int pi_block_cpu;
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +
> > +    if ( !iommu_intpost || !has_arch_pdevs(v->domain) )
> > +        return;
> > +
> > +    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
> > +
> > +    /*
> > +     * Set 'NV' field back to posted_intr_vector, so the
> > +     * Posted-Interrupts can be delivered to the vCPU when
> > +     * it is running in non-root mode.
> > +     */
> > +    if ( pi_desc->nv != posted_intr_vector )
> > +        write_atomic(&pi_desc->nv, posted_intr_vector);
> > +
> > +    /* the vCPU is not on any blocking list. */
> > +    pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
> > +    if ( pi_block_cpu == NR_CPUS )
> > +        return;
> > +
> > +    spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), flags);
> > +
> > +    /*
> > +     * v->arch.hvm_vmx.pi_block_cpu == NR_CPUS here means the vCPU was
> > +     * removed from the blocking list while we are acquiring the lock.
> > +     */
> > +    if ( v->arch.hvm_vmx.pi_block_cpu == NR_CPUS )
> > +    {
> > +        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, 
> > pi_block_cpu),
> flags);
> > +        return;
> > +    }
> > +
> > +    list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> > +    v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
> > +    spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu),
> flags);
> > +}
> >
> >  static int vmx_domain_initialise(struct domain *d)
> >  {
> > @@ -106,10 +230,17 @@ static int vmx_vcpu_initialise(struct vcpu *v)
> >
> >      spin_lock_init(&v->arch.hvm_vmx.vmcs_lock);
> >
> > +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> > +
> > +    v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
> > +
> >      v->arch.schedule_tail    = vmx_do_resume;
> >      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
> >      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
> >
> > +    if ( iommu_intpost )
> > +        v->arch.vcpu_block = vmx_vcpu_block;
> > +
> >      if ( (rc = vmx_create_vmcs(v)) != 0 )
> >      {
> >          dprintk(XENLOG_WARNING,
> > @@ -734,6 +865,7 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
> >      vmx_save_guest_msrs(v);
> >      vmx_restore_host_msrs();
> >      vmx_save_dr(v);
> > +    vmx_pi_switch_from(v);
> >  }
> >
> >  static void vmx_ctxt_switch_to(struct vcpu *v)
> > @@ -758,6 +890,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
> >
> >      vmx_restore_guest_msrs(v);
> >      vmx_restore_dr(v);
> > +    vmx_pi_switch_to(v);
> >  }
> >
> >
> > @@ -2014,6 +2147,40 @@ static struct hvm_function_table __initdata
> > vmx_function_table = {
> >      .altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
> >  };
> >
> > +/* Handle VT-d posted-interrupt when VCPU is blocked. */
> > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> > +{
> > +    struct arch_vmx_struct *vmx, *tmp;
> > +    struct vcpu *v;
> > +    spinlock_t *lock = &per_cpu(pi_blocked_vcpu_lock, smp_processor_id());
> > +    struct list_head *blocked_vcpus =
> > +                       &per_cpu(pi_blocked_vcpu, smp_processor_id());
> > +    LIST_HEAD(list);
> 
> what's it?

Oh, this should be removed. I missed this while removing some code in this
function from v9.

> 
> > +
> > +    ack_APIC_irq();
> > +    this_cpu(irq_count)++;
> > +
> > +    spin_lock(lock);
> > +
> > +    /*
> > +     * XXX: The length of the list depends on how many vCPU is current
> > +     * blocked on this specific pCPU. This may hurt the interrupt latency
> > +     * if the list grows to too many entries.
> > +     */
> > +    list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocked_vcpu_list)
> > +    {
> > +        if ( pi_test_on(&vmx->pi_desc) )
> > +        {
> > +            list_del(&vmx->pi_blocked_vcpu_list);
> > +            vmx->pi_block_cpu = NR_CPUS;
> > +            v = container_of(vmx, struct vcpu, arch.hvm_vmx);
> > +            vcpu_unblock(v);
> > +        }
> > +    }
> > +
> > +    spin_unlock(lock);
> > +}
> > +
> 
> I may miss something here. Who will clear ON bit? test read?

There are two places where ON can get clear.
- In vmx_sync_pir_to_irr() during sync interrupts information from
PIR to IRR
- If the guest is running in non-root, the CPU hardware will clear
'ON' when handling the notification event. No vm-exits in this case.

Thanks,
Feng


> 
> It's unclear from design doc:
> 
> +Outstanding Notification (ON): Indicate if there is a notification event
> outstanding (not
> +processed by processor or software) for this Posted Interrupt Descriptor. 
> When
> this field is 0,
> +hardware modifies it from 0 to 1 when generating a notification event, and 
> the
> entity receiving
> +the notification event (processor or software) resets it as part of posted
> interrupt processing.
> 
> 
> Thanks
> Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.