[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [ARM] Native application design and discussion (I hope)



Hi Julien,

On 25 April 2017 at 14:43, Julien Grall <julien.grall@xxxxxxx> wrote:
>>>>>> We will also need another type of application: one which is
>>>>>> periodically called by XEN itself, not actually servicing any domain
>>>>>> request. This is needed for a
>>>>>> coprocessor sharing framework scheduler implementation.
>>>>>
>>>>>
>>>>> EL0 apps can be a powerful new tool for us to use, but they are not the
>>>>> solution to everything. This is where I would draw the line: if the
>>>>> workload needs to be scheduled periodically, then it is not a good fit
>>>>> for an EL0 app.
>>>>
>>>>
>>>> From my last conversation with Volodymyr I've got a feeling that notions
>>>> "EL0" and "XEN native application" must be pretty orthogonal.
>>>> In [1] Volodymyr got no performance gain from changing domain's
>>>> exception level from EL1 to EL0.
>>>> Only when Volodymyr stripped the domain's context  abstraction (i.e.
>>>> dropped GIC context store/restore) some noticeable results were reached.
>>>
>>>
>>>
>>> Do you have numbers for part that take times in the save/restore? You
>>> mention GIC and I am a bit surprised you don't mention FPU.
>>
>> I did it in the other thread. Check out [1]. The most speed up I got
>> after removing vGIC context handling
>
>
> Oh, yes. Sorry I forgot this thread. Continuing on that, you said that "Now
> profiler shows that hypervisor spends time in spinlocks and p2m code."
>
> Could you expand here? How the EL0 app will spend time in p2m code?
I don't quite remember. It was somewhere around p2m save/restore
context functions.
I'll try to restore that setup and will provide more details.

> Similarly, why spinlocks take time? Are they contented?
Problem is that my profiler does not show stack, so I can't say which
spinlock causes this. But profiler didn't showed that CPU spend much
time in spinlock wait loop. So looks like there are no contention.

>>
>>> I would have a look at optimizing the context switch path. Some ideas:
>>>         - there are a lot of unnecessary isb/dsb. The registers used by
>>> the
>>> guests only will be synchronized by eret.
>>
>> I have removed (almost) all of them. No significant changes in latency.
>>
>>>         - FPU is taking time to save/restore, you could make it lazy
>>
>> This also does not takes much time.
>>
>>>         - It might be possible to limit the number of LRs saved/restored
>>> depending on the number of LRs used by a domain.
>>
>> Excuse me, what is LR in this context?
>
>
> Sorry I meant GIC LRs (see GIC save/restore code). They are used to list the
> interrupts injected to the guest. All of they may not be used at the time of
> the context switch.
As I said, I don't call GIC save and restore routines, So, that should
no be an issue (if I got that right).

>>
>> You can take a look at my context switch routines at [2].
>
>
> I had a quick look and I am not sure which context switch you exactly used
> as you split it into 2 helpers but also modify the current one.
>
> Could you briefly describe the context switch you do for EL0 app here?
As I said, I tried to reuse all existing services. My PoC hosts app in
separate domain. Also this domain have own vcpu. So, at first I used
the plain old ctxt_switch_from()/ctxt_switch_to() pair from domain.c.
You know that those two functions save/restore almost all state of
vCPU except pc, sp, lr and other general purpose registers. The
remaining context is saved/restored in entry.S
I just made v->arch.cpu_info->guest_cpu_user_regs.pc to point to app
entry point and changed saved cpsr, to switch right into el0.

Then I copied  ctxt_switch_from()/ctxt_switch_to() to
ctxt_switch_from_partial()/ctxt_switch_to_partial() and began to
remove all unneeded code (dsb()'s\isb()'s, gic context handling, etc).
So, overall flow is following:

0. If it is the first call, then I create 1:1 VM mapping and program
ttbr0, ttbrc, mair  registers of app vcpu.
1. I pause a calling vcpu
2. I program saved pc of app vcpu to point to the app entry point, sp
to point to a top of a stack, cpsr to entry in el0 mode.
3. I call ctxt_switch_from_partial() to save context of calling vcpu
4. I enable TGE bit
5. I call ctx_switch_to_partial() to restore context of app vcpu
6. I call __save_context() to save rest of the context of calling vcpu
(pc, sp, lr, r0-r31).
7. I invoke switch_stack_and_jump() to restore rest of the context of app vcpu
8. Now I'm in EL0 app. Hooray! App does something, invokes syscalls
(which are handled in hypervisor) and so on.
9. App invoke syscall named app_exit()
10.I use  ctxt_switch_from_partial() to save app state (actually it is
not needed, I think)
11. I use ctxt_swtich_to_partial() to restore calling vcpu state
12. I unpause calling vcpu and drop TGE bit.
13. I call __restore_context() to restore pc, lr and friends. At this
time code jumps back to p.6 (because I saved pc there). But it checks
flag variable and sees that it is actually exit from app.
14. ... so it exits back to calling domain.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.