[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling




> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Sent: Monday, January 18, 2016 11:14 PM
> To: Wu, Feng <feng.wu@xxxxxxxxx>
> Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>; Dario Faggioli
> <dario.faggioli@xxxxxxxxxx>; George Dunlap <george.dunlap@xxxxxxxxxxxxx>;
> Tian, Kevin <kevin.tian@xxxxxxxxx>; xen-devel@xxxxxxxxxxxxx; Keir Fraser
> <keir@xxxxxxx>
> Subject: Re: [PATCH v10 6/7] vmx: VT-d posted-interrupt core logic handling
> 
> >>> On 03.12.15 at 09:35, <feng.wu@xxxxxxxxx> wrote:
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -83,7 +83,131 @@ static int vmx_msr_write_intercept(unsigned int
> msr, uint64_t msr_content);
> > +    ASSERT(v->arch.hvm_vmx.pi_block_cpu == NR_CPUS);
> > +
> > +    /*
> > +     * The vCPU is blocking, we need to add it to one of the per pCPU 
> > lists.
> > +     * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it
> for
> > +     * the per-CPU list, we also save it to posted-interrupt descriptor and
> > +     * make it as the destination of the wake-up notification event.
> > +     */
> > +    v->arch.hvm_vmx.pi_block_cpu = v->processor;
> > +
> > +    spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> > +                      v->arch.hvm_vmx.pi_block_cpu), flags);
> > +    list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> > +                  &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
> > +    spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> > +                           v->arch.hvm_vmx.pi_block_cpu), flags);
> > +
> > +    ASSERT(!pi_test_sn(pi_desc));
> > +
> > +    /*
> > +     * We don't need to set the 'NDST' field, since it should point to
> > +     * the same pCPU as v->processor.
> > +     */
> > +
> > +    write_atomic(&pi_desc->nv, pi_wakeup_vector);
> 
> Stray blank line between comment and corresponding code.
> 
> Also the ASSERT() is rather more disconnected from the write
> than seems desirable: Wouldn't it be better to cmpxchg() the
> whole 32-bit value, validating that SN is clear in the result?

Not quite understand this. The control field is 64 bits, do you
mean cmpxchg() the whole 64-bit value like follows:

+        do {
+            old.control = new.control = pi_desc->control;
+            new.nv = pi_wakeup_vector;
+        } while ( cmpxchg(&pi_desc->control, old.control, new.control)
+                  != old.control );

This a somewhat like the implementation in the earlier versions,
why do you want to change it back?

> > +    pi_block_cpu = v->arch.hvm_vmx.pi_block_cpu;
> > +    if ( pi_block_cpu == NR_CPUS )
> > +        return;
> 
> Are this condition and the one in the immediately preceding if()
> connected in any way?

I am not sure I understand this correctly. Which one is
" the one in the immediately preceding if()"?

> I.e. could the latter perhaps become an
> ASSERT() by connecting the two blocks suitably? I'm simply
> trying to limit the number of different cases to consider mentally
> when looking at this code ...

If we get a true from above check, it means the vcpu was removed by
pi_wakeup_interrupt(), then we just need to return. If we get a false,
we will acquire the spinlock as the following code does, there are two
scenarios:
- pi_wakeup_interrupt() acquires the spinlock before us, then when
we get the lock, 'v->arch.hvm_vmx.pi_block_cpu' has already been
set to NR_CPUS by pi_wakeup_interrupt(), we need do nothing.
- We get the spinlock before pi_wakeup_interrupt(), this time we need
to remove the vCPU from the list and set 'v->arch.hvm_vmx.pi_block_cpu'
to NR_CPUS.

> 
> > +    spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock, pi_block_cpu),
> flags);
> > +
> > +    /*
> > +     * v->arch.hvm_vmx.pi_block_cpu == NR_CPUS here means the vCPU
> was
> > +     * removed from the blocking list while we are acquiring the lock.
> > +     */
> > +    if ( v->arch.hvm_vmx.pi_block_cpu == NR_CPUS )
> 
> With you wanting to deal with changes behind your back here,
> isn't there come kind of barrier needed between reading and using
> pi_block_cpu, such that the compiler won't convert the local
> variable accesses into multiple reads of
> v->arch.hvm_vmx.pi_block_cpu (which iiuc it is allowed to do)?

Will think about this.

> 
> > @@ -106,10 +230,17 @@ static int vmx_vcpu_initialise(struct vcpu *v)
> >
> >      spin_lock_init(&v->arch.hvm_vmx.vmcs_lock);
> >
> > +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> > +
> > +    v->arch.hvm_vmx.pi_block_cpu = NR_CPUS;
> > +
> >      v->arch.schedule_tail    = vmx_do_resume;
> >      v->arch.ctxt_switch_from = vmx_ctxt_switch_from;
> >      v->arch.ctxt_switch_to   = vmx_ctxt_switch_to;
> >
> > +    if ( iommu_intpost )
> > +        v->arch.vcpu_block = vmx_vcpu_block;
> 
> Considering the locking question above - couldn't this be done only
> when a device gets added, and the pointer zapped upon removal
> of the last device, at once saving the conditional inside the hook?

That is another solution, but I think it make simple thing difficult,
I feel the current logic is simple and clear. Btw, this hook is not
a critical path, another conditional in it should not a big deal. 

> > --- a/xen/common/schedule.c
> > +++ b/xen/common/schedule.c
> > @@ -802,6 +802,8 @@ void vcpu_block(void)
> >
> >      set_bit(_VPF_blocked, &v->pause_flags);
> >
> > +    arch_vcpu_block(v);
> > +
> >      /* Check for events /after/ blocking: avoids wakeup waiting race. */
> >      if ( local_events_need_delivery() )
> >      {
> 
> ISTR some previous discussion on this placement - wasn't it
> determined that putting it in the else part here would be
> sufficient for your needs and better in terms of overhead?
> Or is there some race possible if moved past the invocation
> of local_events_need_delivery()?

The discussion happened several month ago, it is really hard to recall
every details, I have to look for it from my index. It is really bad to
last for so long for the review.

We need to call arch_vcpu_block() before local_events_need_delivery(),
since VT-d hardware can issue notification event when we are in
arch_vcpu_block(), it that happens, 'ON' bit is set, then we will check
it in local_events_need_delivery(). If we do arch_vcpu_block() in the
else part, we cannot handle this. This is one reason I can recall, I am
not sure whether there are other concerns since it has been really
a long time. The current implementation is fully discussed with scheduler
maintainers.

BTW, I don't think the current implantation introduces
much overhead.

> 
> > @@ -839,6 +841,8 @@ static long do_poll(struct sched_poll *sched_poll)
> >      v->poll_evtchn = -1;
> >      set_bit(v->vcpu_id, d->poll_mask);
> >
> > +    arch_vcpu_block(v);
> > +
> >  #ifndef CONFIG_X86 /* set_bit() implies mb() on x86 */
> >      /* Check for events /after/ setting flags: avoids wakeup waiting race. 
> > */
> >      smp_mb();
> 
> Same question here then regarding moving this further down.
> 
> > --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> > +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> 
> > +    unsigned int         pi_block_cpu;
> 
> Looking at all the use sites I wonder whether it wouldn't make for
> easier to read code if you instead stored the lock to acquire. This
> would then perhaps also prepare a road for balancing lists growing
> too large (i.e. rather than having one list per CPU, you could
> then have a variable number of lists, depending on current needs,
> and [kind of] freely select one of them).

Since this patch is still experimental, can we put the "length of the list"
issue in the next stage when we turn it to default on?

Thanks,
Feng

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.