[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Notes on stubdoms and latency on ARM



Hi again,

On 7 July 2017 at 09:41, Dario Faggioli <dario.faggioli@xxxxxxxxxx> wrote:
> On Fri, 2017-07-07 at 18:02 +0300, Volodymyr Babchuk wrote:
>> Hello Dario,
>>
> Hi!
>
>> On 20 June 2017 at 13:11, Dario Faggioli <dario.faggioli@xxxxxxxxxx>
>> wrote:
>> > On Mon, 2017-06-19 at 11:36 -0700, Volodymyr Babchuk wrote:
>> > >
>> > > Thanks. Actually, we discussed this topic internally today. Main
>> > > concern today is not a SMCs and OP-TEE (I will be happy to do
>> > > this
>> > > right in XEN), but vcopros and GPU virtualization. Because of
>> > > legal
>> > > issues, we can't put this in XEN. And because of vcpu framework
>> > > nature
>> > > we will need multiple calls to vgpu driver per one vcpu context
>> > > switch.
>> > > I'm going to create worst case scenario, where multiple vcpu are
>> > > active and there are no free pcpu, to see how credit or credit2
>> > > scheduler will call my stubdom.
>> > >
>> >
>> > Well, that would be interesting and useful, thanks for offering
>> > doing
>> > that.
>>
>> Yeah, so I did that.
>>
> Ok, great! Thanks for doing and reporting about this. :-D
>
>> And I have get some puzzling results. I don't know why,
>> but when I have 4 (or less) active vcpus on 4 pcpus, my test  takes
>> about 1 second to execute.
>> But if there are 5 (or mode) active vcpus on 4 pcpus, it executes
>> from
>> 80 to 110 seconds.
>>
> I see. So, I've got just a handful of minutes right now, to only
> quickly look at the result and ask a couple of questions. Will think
> about this more in the coming days...
>
>> There will be the details, but first let me remind you my setup.
>>  I'm testing on ARM64 machine with 4 Cortex A57 cores. I wrote
>> special test driver for linux, that calls SMC instruction 100 000
>> times.
>> Also I hacked miniOS to act as monitor for DomU. This means that
>> XEN traps SMC invocation and asks MiniOS to handle this.
>>
> Ok.
>
>> So, every SMC is handled in this way:
>>
>> DomU->XEN->MiniOS->XEN->DomU.
>>
> Right. Nice work again.
>
>> Now, let's get back to results.
>>
>> ** Case 1:
>> - Dom0 has 4 vcpus and is idle
>> - DomU has 4 vcpus and is idle
>> - Minios has 1 vcpu and is not idle, because it's scheduler does
>> not calls WFI.
>> I run test in DomU:
>>
>> root@salvator-x-h3-xt:~# time -p cat /proc/smc_bench
>> Will call SMC 100000 time(s)
>>
> So, given what you said above, this means that the vCPU that is running
> this will frequently block (when calling SMC) and resume (when SMC is
> handled) quite frequently, right?
Yes, exactly. There is vm_event_vcpu_pause(v) call in monitor.c

>
> Also, are you sure (e.g., because of how the Linux driver is done) that
> this always happen on one vCPU?
No, I can't guarantee that. Linux driver is single threaded, but I did
nothing to pin in to a certain CPU.

>
>> Done!
>> real 1.10
>> user 0.00
>> sys 1.10
>
>> ** Case 2:
>> - Dom0 has 4 vcpus. They all are executing endless loop with sh
>> oneliner:
>> # while : ; do : ; done &
>> - DomU has 4 vcpus and is idle
>> - Minios has 1 vcpu and is not idle, because it's scheduler does not
>> calls WFI.
>>
> Ah, I see. This is unideal IMO. It's fine for this POC, of course, but
> I guess you've got plans to change this (if we decide to go the stubdom
> route)?
Sure. There are much to be done in MiniOS to make it production-grade.

>
>> - In total there are 6 vcpus active
>>
>> I run test in DomU:
>> real 113.08
>> user 0.00
>> sys 113.04
>>
> Ok, so there's contention for pCPUs. Dom0's vCPUs are CPU hogs, while,
> if my assumption above is correct, the "SMC vCPU" of the DomU is I/O
> bound, in the sense that it blocks on an operation --which turns out to
> be SMC call to MiniOS-- then resumes and block again almost
> immediately.
>
> Since you are using Credit, can you try to disable context switch rate
> limiting? Something like:
>
> # xl sched-credit -s -r 0
>
> should work.
Yep. You are right. In the environment described above (Case 2) I now
get much better results:

 real 1.85
user 0.00
sys 1.85


> This looks to me like one of those typical scenario where rate limiting
> is counterproductive. In fact, every time that your SMC vCPU is woken
> up, despite being boosted, it finds all the pCPUs busy, and it can't
> preempt any of the vCPUs that are running there, until rate limiting
> expires.
>
> That means it has to wait an interval of time that varies between 0 and
> 1ms. This happens 100000 times, and 1ms*100000 is 100 seconds... Which
> is roughly how the test takes, in the overcommitted case.
Yes, looks like that was the case. Does this means that ratelimiting
should be disabled for any domain that is backed up with device model?
AFAIK, device models are working in the exactly same way.

>> * Case 7
>> - Dom0 has 4 vcpus and is idle.
>> - DomU has 4 vcpus. Two of them are executing endless loop with sh
>> oneliner:
>> # while : ; do : ; done &
>> - Minios have 1 vcpu and is not idle, because it's scheduler does not
>> calls WFI.
>> - *Minios is running on separate cpu pool with 1 pcpu*:
>> Name               CPUs   Sched     Active   Domain count
>> Pool-0               3    credit       y          2
>> minios               1    credit       y          1
>>
>> I run test in DomU:
>> real 1.11
>> user 0.00
>> sys 1.10
>>
>> * Case 8
>> - Dom0 has 4 vcpus and is idle.
>> - DomU has 4 vcpus. Three of them are executing endless loop with sh
>> oneliner:
>> # while : ; do : ; done &
>> - Minios have 1 vcpu and is not idle, because it's scheduler does not
>> calls WFI.
>> - Minios is running on separate cpu pool with 1 pcpu:
>>
>> I run test in DomU:
>> real 100.12
>> user 0.00
>> sys 100.11
>>
>>
>> As you can see, I tried to move minios to separate cpu pool. But it
>> didn't helped a lot.
>>
> Yes, but it again makes sense. In fact, now there are 3 CPUs in Pool-0,
> and all are kept always busy by the the 3 DomU vCPUs running endless
> loops. So, when the DomU's SMC vCPU wakes up, has again to wait for the
> rate limit to expire on one of them.
Yes, as this was caused by ratelimit, this makes perfect sense. Thank you.

I tried number of different cases. Now execution time depends linearly
on number of over-committed vCPUs (about +200ms for every busy vCPU).
That is what I'm expected.

-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@xxxxxxxxx

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.