[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] EL0 app, stubdoms on ARM conf call



Hello all,

Thank you all for the call.

As was agreed, I'll to provide some details on our use cases. I want
to tell you about four cases: one is OP-TEE related, while other three
shows various aspects of virtualized coprocesssors workflow.

1. OP-TEE use case: DRM playback (secure data path).

User wants to play a DRM-protected media file. Rights holders don't
want to give user any means to get DRM-free copy of that media file.
If you ever heard about Widevine on Android - that it is. Long story
short, it is possible to decrypt, decode and display a video frame in
a such way, that decrypted data will never be accessible to userspace,
kernel or even to hypervisor. This is possible only when all data
processing is done in secure mode, which leads us to OP-TEE or
(another TEE).
So, for each video frame media player should call OP-TEE with
encrypted frame data.

Good case: 24FPS movie, optimized data path: media player registers
shared buffers in OP-TEE only once and then reuses them during every
invocation. That would be one OP-TEE call per frame or 24 calls per
second.
Worst case: High frame rate movie (60 FPS), data path in not
optimized. Media player registers shared buffer in OP-TEE, then asks
it to process frame, then unregisters buffer. 60 * 3 = 180 calls per
second.

Сall is done using SMC instruction. Let's assume that OP-TEE mediator
lives in Stubdom. There is how call sequence can look like:

1. DomU issues SMC, which is trapped by Hypervisor
2. Hypervisor uses standard approach with ring buffer and event
mechanism to call Stubdom. Also it blocks DomU's vCPU which caused
this trap.
3a. Stubdom mangles request and asks Hypervisor to issue real SMC
(3b. Stubdom mangles request and issues SMC by itself - potentially insecure)
4. After real SMC, Hypervisor returns control back to Stubdom
5. Stubdom mangles return value and returns response to Hypervisor in
a ring buffer
6. Hypervisor unblocks DomU's VCPU and schedules it.

As you can see, there are 6 context switches
(DomU->HYP->Stubdom->HYP->Stubdom->HYP->DomU). There are 2 VCPU
switches (DomU->Stubdom->DomU). Both VCPU switches are governed by a
scheduler.
When I say "governed by scheduler" I imply that there are no
guarantees that needed domain will be scheduled right now.
This is sequence for one call. As you remember, there can be up to 180
such calls per second in this use case. That gives us 180 * 6 ~= 1000
context switches per second.


2. Coprocessor use case: coprocessor context switch.

Lets assume that coprocessor was used by Dom1 and now it is time to
switch context, so Dom2 can use it. Returning back to GPU case, if we
want to show 60 FPS, then we need at least 60*N context switches,
where N is number of domains that use GPU. This is lower margin,
obviously. Context switch is done in two parts: "context switch from"
and "context switch to". Context switch procedure is device-specific,
so there should be driver for every supported device. This driver does
actual work. We can't have this driver in hypervisor. Let's assume
that driver is running in a Stubdom.
Context switch is requested by the hypervisor. So, best-case scenario
is following:

1. Hypervisor asks Stubdom to do "context switch from"
2. Stubdom sends event back to hypervisor when task is done
(Hypervisor reconfigures IOMMU)
3. Hypervisor asks Stubdom to do "context switch to"
4. Stubdom sends event back to hypervisor when task is done

You can't merge Stubdomain call to "context switch from/to", because
between p.2 and p.3 hypervisor needs to reconfigure IOMMU for GPU.
So, there are 4 context switches, two of them are governed by
scheduler. Or this is 240 context switches per second per domain per
coprocessor. As was said, this is lower margin.

3. Coprocessor use case: MMIO access from domain to a virtualized device.

Usually communication between processor and coprocessor is done in the
following way: processor writes command into a shared memory and than
kick interrupt in coprocessor, coprocessor processes task, writes
response back to a shared memory and issues IRQ to a processor.
Coprocessor is kicked by writing to one of its registers that are
mapped to a memory.

In case if vcoproc is active right now, we *might* can pass this MMIO
access right to it. But in our current case, we nevertheless need to
trap this access and route them to the driver. If vcoproc is not
active, we always need to route this MMIO access to the driver,
because only driver knows what to do with this requests right now.

So, summarizing, domain will write to MMIO range every time it wants
something from coprocessor. There can be hundreds such calls for *one*
frame (e.g. load texture, load shader, load geometry, run shader,
repeat). How it looks:
1. DomU writes or reads to/from MMIO register.
2. XEN traps this access and notifies Stubdom (also it blocks DomU vcpu)
3. Stubdom analyzes request and does actual write (or stores value internally).
4. Stubdom sends event back to XEN
5. XEN unblocks DomU vcpu.

That gives us four context switches (two of them are governed by
scheduler). As I said, there can be hundreds such writes for every
frame.  Which gives us 100*60*4 = 24 000 switches per second per
domain. This no lower margin, but it also not higher margin.

4. Coprocessor use case: Interrupt from virtualized device to a domain.
As I said, coprocessor will send interrupt back, when in finishes a
task. Again, driver needs to process this interrupt before forwarding
it to the DomU:

1. XEN receives interrupt  and routes it to Stubdom (probably vGIC
can done this for us, so we will not trap into HYP).
2. Stubdom receives interrupt, handles it and asks XEN to inject it into DomU.

Two context switches, both governed by scheduler. This is additional
12 000 switches per second.


As you can see, the worst scenarios are 3 and 4. We are working to
optimize them. Ideal solution will be eliminate them at all, or at
least don't trap IRQs and MMIO access for active vcoproc. But we need
to trap MMIO access for inactive vcoproc in any case.

I think, how you have some understanding regarding our requirements.
Please feel free to ask any questions.
Also I want to say thank you to Oleksandr Andrushchenko and Andrii
Anisov for briefing be about VCF workflows.

On 16 June 2017 at 20:19, Stefano Stabellini <sstabellini@xxxxxxxxxx> wrote:
> On Fri, 16 Jun 2017, Dario Faggioli wrote:
>> On Thu, 2017-06-15 at 13:14 -0700, Stefano Stabellini wrote:
>> > On Thu, 15 Jun 2017, Volodymyr Babchuk wrote:
>> > > Hello Stefano,
>> > > On 15 June 2017 at 21:21, Stefano Stabellini
>> > > <sstabellini@xxxxxxxxxx> wrote:
>> > > > Would you be up for joining a conf call to discuss EL0 apps and
>> > > > stubdoms
>> > > > on ARM in preparation for Xen Developer Summit?
>> > > >
>> > > > If so, would Wednesday the 28th of June at 9AM PST work for you?
>> > >
>> > > I would prefer later time (like 5PM), but 9AM also works for me.
>> >
>> >
>> > Wait, did you get the timezone right?
>> >
>> > 1) 9AM PST = 5PM London = 7PM Kyiv
>> >
>> Count me in.
>>
>> It would be great if someone could send an meeting invite, so that my
>> mailer will do the timezone conversion and set reminders, and I don't
>> risk showing up on the wrong day at the wrong time. :-P
>
> I'll do.



-- 
WBR Volodymyr Babchuk aka lorc [+380976646013]
mailto: vlad.babchuk@xxxxxxxxx

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.