[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] vmx: VT-d posted-interrupt core logic handling

On 10/03/16 11:16, David Vrabel wrote:
> On 10/03/16 10:46, George Dunlap wrote:
>> On 10/03/16 10:35, David Vrabel wrote:
>>> On 10/03/16 10:18, Jan Beulich wrote:
>>>>>>> On 10.03.16 at 11:05, <kevin.tian@xxxxxxxxx> wrote:
>>>>>>  From: Tian, Kevin
>>>>>> Sent: Thursday, March 10, 2016 5:20 PM
>>>>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>>>>>> Sent: Thursday, March 10, 2016 5:06 PM
>>>>>>>> There are many linked list usages today in Xen hypervisor, which
>>>>>>>> have different theoretical maximum possible number. The closest
>>>>>>>> one to PI might be the usage in tmem (pool->share_list) which is
>>>>>>>> page based so could grow 'overly large'. Other examples are
>>>>>>>> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>>>>>>>> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>>>>>>>> in MSI-x virtualization (which could be 2^11 per spec). Do we
>>>>>>>> also want to create some artificial scenarios to examine them
>>>>>>>> since based on actual operation K-level entries may also become
>>>>>>>> a problem?
>>>>>>>> Just want to figure out how best we can solve all related linked-list
>>>>>>>> usages in current hypervisor.
>>>>>>> As you say, those are (perhaps with the exception of tmem, which
>>>>>>> isn't supported anyway due to XSA-15, and which therefore also
>>>>>>> isn't on by default) in the order of a few thousand list elements.
>>>>>>> And as mentioned above, different bounds apply for lists traversed
>>>>>>> in interrupt context vs such traversed only in "normal" context.
>>>>>> That's a good point. Interrupt context should have more restrictions.
>>>>> Hi, Jan,
>>>>> I'm thinking your earlier idea about evenly distributed list:
>>>>> --
>>>>> Ah, right, I think that limitation was named before, yet I've
>>>>> forgotten about it again. But that only slightly alters the
>>>>> suggestion: To distribute vCPU-s evenly would then require to
>>>>> change their placement on the pCPU in the course of entering
>>>>> blocked state.
>>>>> --
>>>>> Actually after more thinking, there is no hard requirement that
>>>>> the vcpu must block on the pcpu which is configured in 'NDST'
>>>>> of that vcpu's PI descriptor. What really matters, is that the
>>>>> vcpu is added to the linked list of the very pcpu, then when PI
>>>>> notification comes we can always find out the vcpu struct from
>>>>> that pcpu's linked list. Of course one drawback of such placement
>>>>> is additional IPI incurred in wake up path.
>>>>> Then one possible optimized policy within vmx_vcpu_block could 
>>>>> be:
>>>>> (Say PCPU1 which VCPU1 is currently blocked on)
>>>>> - As long as the #vcpus in the linked list on PCPU1 is below a 
>>>>> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
>>>>> Upon PI notification on PCPU1, local linked list is searched to
>>>>> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>>>>> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution 
>>>>> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
>>>>> but NDST set to PCPU2. Upon notification on PCPU2, local linked
>>>>> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to 
>>>>> unblock VCPU1;
>>>> Sounds possible, if the lock handling can be got right. But of
>>>> course there can't be any hard limit like 16, at least not alone
>>>> (on a systems with extremely many mostly idle vCPU-s we'd
>>>> need to allow larger counts - see my earlier explanations in this
>>>> regard).
>>> You could also consider only waking the first N VCPUs and just making
>>> the rest runnable.  If you wake more VCPUs than PCPUs at the same time
>>> most of them won't actually be scheduled.
>> "Waking" a vcpu means "changing from blocked to runnable", so those two
>> things are the same.  And I can't figure out what you mean instead --
>> can you elaborate?
>> Waking up 1000 vcpus is going to take strictly more time than checking
>> whether there's a PI interrupt pending on 1000 vcpus to see if they need
>> to be woken up.
> Waking means making it runnable /and/ attempt to make it running.
> So I mean, for the > N'th VCPU don't call __runq_tickle(), only call
> __runq_insert().

I'm not sure that would satisfy Jan; inserting 1000 vcpus into the
runqueue (much less inserting 4 million vcpus) is still going to take
quite a while, even without looking for a place to run them.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.