[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling



> From: Wu, Feng
> Sent: Thursday, December 03, 2015 4:36 PM
> 
> This is the core logic handling for VT-d posted-interrupts. Basically it
> deals with how and when to update posted-interrupts during the following
> scenarios:
> - vCPU is preempted
> - vCPU is slept
> - vCPU is blocked
> 
> When vCPU is preempted/slept, we update the posted-interrupts during
> scheduling by introducing two new architecutral scheduler hooks:
> vmx_pi_switch_from() and vmx_pi_switch_to(). When vCPU is blocked, we
> introduce a new architectural hooks: arch_vcpu_block() to update

hooks -> hook

> posted-interrupts descriptor.
> 
> Besides that, before VM-entry, we will make sure the 'NV' filed is set
> to 'posted_intr_vector' and the vCPU is not in any blocking lists, which
> is needed when vCPU is running in non-root mode. The reason we do this check
> is because we change the posted-interrupts descriptor in vcpu_block(),
> however, we don't change it back in vcpu_unblock() or when vcpu_block()
> directly returns due to event delivery (in fact, we don't need to do it
> in the two places, that is why we do it before VM-Entry).
> 
> When we handle the lazy context switch for the following two scenarios:
> - Preempted by a tasklet, which uses in an idle context.
> - the prev vcpu is in offline and no new available vcpus in run queue.
> We don't change the 'SN' bit in posted-interrupt descriptor, this
> may incur spurious PI notification events, but since PI notification
> event is only sent when 'ON' is clear, and once the PI notification
> is sent, ON is set by hardware, hence no more notification events
> before 'ON' is clear. Besides that, spurious PI notification events are
> going to happen from time to time in Xen hypervisor, such as, when
> guests trap to Xen and PI notification event happens, there is
> nothing Xen actually needs to do about it, the interrupts will be
> delivered to guest atht the next time we do a VMENTRY.
> 
> CC: Keir Fraser <keir@xxxxxxx>
> CC: Jan Beulich <jbeulich@xxxxxxxx>
> CC: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> CC: Kevin Tian <kevin.tian@xxxxxxxxx>
> CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> CC: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> Suggested-by: Yang Zhang <yang.z.zhang@xxxxxxxxx>
> Suggested-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
> Suggested-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
> Suggested-by: Jan Beulich <jbeulich@xxxxxxxx>
> Signed-off-by: Feng Wu <feng.wu@xxxxxxxxx>
> ---
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 6c2b512..3368cf2 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -7019,6 +7019,12 @@ void hvm_domain_soft_reset(struct domain *d)
>      hvm_destroy_all_ioreq_servers(d);
>  }
> 
> +void arch_vcpu_block(struct vcpu *v)
> +{
> +    if ( v->arch.vcpu_block )
> +        v->arch.vcpu_block(v);
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> index 000d06e..0f23fce 100644
> --- a/xen/arch/x86/hvm/vmx/vmcs.c
> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> @@ -676,6 +676,8 @@ int vmx_cpu_up(void)
>      if ( cpu_has_vmx_vpid )
>          vpid_sync_all();
> 
> +    vmx_pi_per_cpu_init(cpu);
> +
>      return 0;
>  }
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 39dc500..0d9462e 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -83,7 +83,131 @@ static int vmx_msr_write_intercept(unsigned int msr, 
> uint64_t
> msr_content);
>  static void vmx_invlpg_intercept(unsigned long vaddr);
>  static int vmx_vmfunc_intercept(struct cpu_user_regs *regs);
> 
> +/*
> + * We maintain a per-CPU linked-list of vCPU, so in PI wakeup handler we

vCPU -> vCPUs

> + * can find which vCPU should be woken up.
> + */
> +static DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
> +static DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
> +
>  uint8_t __read_mostly posted_intr_vector;
> +uint8_t __read_mostly pi_wakeup_vector;
> +
> +void vmx_pi_per_cpu_init(unsigned int cpu)
> +{
> +    INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
> +    spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
> +}
> +
> +void vmx_vcpu_block(struct vcpu *v)
> +{
> +    unsigned long flags;
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +    if ( !has_arch_pdevs(v->domain) )
> +        return;
> +
> +    ASSERT(v->arch.hvm_vmx.pi_block_cpu == NR_CPUS);
> +
> +    /*
> +     * The vCPU is blocking, we need to add it to one of the per pCPU lists.
> +     * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for
> +     * the per-CPU list, we also save it to posted-interrupt descriptor and
> +     * make it as the destination of the wake-up notification event.

the 2nd piece - "we also save it to posted-interrupt descriptor" is not
reflected within this function. Do you mean "we have saved it to..."
or "we will save it later to..." in other places?

> +     */
> +    v->arch.hvm_vmx.pi_block_cpu = v->processor;
> +
> +    spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> +                      v->arch.hvm_vmx.pi_block_cpu), flags);
> +    list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> +                  &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
> +    spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> +                           v->arch.hvm_vmx.pi_block_cpu), flags);
> +
> +    ASSERT(!pi_test_sn(pi_desc));
> +
> +    /*
> +     * We don't need to set the 'NDST' field, since it should point to
> +     * the same pCPU as v->processor.
> +     */
> +
> +    write_atomic(&pi_desc->nv, pi_wakeup_vector);
> +}
> +
> +static void vmx_pi_switch_from(struct vcpu *v)
> +{
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +    if ( !iommu_intpost || !has_arch_pdevs(v->domain) ||
> +         test_bit(_VPF_blocked, &v->pause_flags) )
> +        return;
> +
> +    /*
> +     * The vCPU has been preempted or went to sleep. We don't need to send
> +     * notification event to a non-running vcpu, the interrupt information

I might be wrong on this, but my gut-feeling is that 'non-running' includes
both runnable and blocked vcpu...

> +     * will be delivered to it before VM-ENTRY when the vcpu is scheduled
> +     * to run next time.
> +     */
> +    pi_set_sn(pi_desc);
> +}
> +
> +static void vmx_pi_switch_to(struct vcpu *v)
> +{
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +    if ( !iommu_intpost || !has_arch_pdevs(v->domain) )
> +        return;
> +
> +    if ( x2apic_enabled )
> +        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
> +    else
> +        write_atomic(&pi_desc->ndst,
> +                     MASK_INSR(cpu_physical_id(v->processor),
> +                     PI_xAPIC_NDST_MASK));
> +
> +    pi_clear_sn(pi_desc);
> +}
> +
> +static void vmx_pi_state_to_normal(struct vcpu *v)
> +{
> +    unsigned long flags;
> +    unsigned int pi_block_cpu;
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +    if ( !iommu_intpost || !has_arch_pdevs(v->domain) )
> +        return;
> +
> +    ASSERT(!test_bit(_VPF_blocked, &v->pause_flags));
> +
> +    /*
> +     * Set 'NV' field back to posted_intr_vector, so the
> +     * Posted-Interrupts can be delivered to the vCPU when
> +     * it is running in non-root mode.
> +     */
> +    if ( pi_desc->nv != posted_intr_vector )
> +        write_atomic(&pi_desc->nv, posted_intr_vector);
> +
> +    /* the vCPU is not on any blocking list. */
> +    pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
> +    if ( pi_block_cpu == NR_CPUS )
> +        return;
> +
> +    spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), flags);
> +
> +    /*
> +     * v->arch.hvm_vmx.pi_block_cpu == NR_CPUS here means the vCPU was
> +     * removed from the blocking list while we are acquiring the lock.
> +     */
> +    if ( v->arch.hvm_vmx.pi_block_cpu == NR_CPUS )
> +    {
> +        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), 
> flags);
> +        return;
> +    }
> +
> +    list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> +    v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
> +    spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu), 
> flags);
> +}
> 
>  static int vmx_domain_initialise(struct domain *d)
>  {
> @@ -106,10 +230,17 @@ static int vmx_vcpu_initialise(struct vcpu *v)
> 
>      spin_lock_init(&v->arch.hvm_vmx.vmcs_lock);
> 
> +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> +
> +    v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
> +
>      v->arch.schedule_tail    = vmx_do_resume;
>      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
>      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
> 
> +    if ( iommu_intpost )
> +        v->arch.vcpu_block = vmx_vcpu_block;
> +
>      if ( (rc = vmx_create_vmcs(v)) != 0 )
>      {
>          dprintk(XENLOG_WARNING,
> @@ -734,6 +865,7 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
>      vmx_save_guest_msrs(v);
>      vmx_restore_host_msrs();
>      vmx_save_dr(v);
> +    vmx_pi_switch_from(v);
>  }
> 
>  static void vmx_ctxt_switch_to(struct vcpu *v)
> @@ -758,6 +890,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
> 
>      vmx_restore_guest_msrs(v);
>      vmx_restore_dr(v);
> +    vmx_pi_switch_to(v);
>  }
> 
> 
> @@ -2014,6 +2147,40 @@ static struct hvm_function_table __initdata
> vmx_function_table = {
>      .altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
>  };
> 
> +/* Handle VT-d posted-interrupt when VCPU is blocked. */
> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> +{
> +    struct arch_vmx_struct *vmx, *tmp;
> +    struct vcpu *v;
> +    spinlock_t *lock = &per_cpu(pi_blocked_vcpu_lock, smp_processor_id());
> +    struct list_head *blocked_vcpus =
> +                       &per_cpu(pi_blocked_vcpu, smp_processor_id());
> +    LIST_HEAD(list);

what's it?

> +
> +    ack_APIC_irq();
> +    this_cpu(irq_count)++;
> +
> +    spin_lock(lock);
> +
> +    /*
> +     * XXX: The length of the list depends on how many vCPU is current
> +     * blocked on this specific pCPU. This may hurt the interrupt latency
> +     * if the list grows to too many entries.
> +     */
> +    list_for_each_entry_safe(vmx, tmp, blocked_vcpus, pi_blocked_vcpu_list)
> +    {
> +        if ( pi_test_on(&vmx->pi_desc) )
> +        {
> +            list_del(&vmx->pi_blocked_vcpu_list);
> +            vmx->pi_block_cpu = NR_CPUS;
> +            v = container_of(vmx, struct vcpu, arch.hvm_vmx);
> +            vcpu_unblock(v);
> +        }
> +    }
> +
> +    spin_unlock(lock);
> +}
> +

I may miss something here. Who will clear ON bit? test read?

It's unclear from design doc:

+Outstanding Notification (ON): Indicate if there is a notification event 
outstanding (not
+processed by processor or software) for this Posted Interrupt Descriptor. When 
this field is 0,
+hardware modifies it from 0 to 1 when generating a notification event, and the 
entity receiving
+the notification event (processor or software) resets it as part of posted 
interrupt processing.


Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.