Xen project Mailing List

Re: [Xen-devel] EL0 app, stubdoms on ARM conf call

Hello all, Thank you all for the call. As was agreed, I'll to provide some details on our use cases. I want to tell you about four cases: one is OP-TEE related, while other three shows various aspects of virtualized coprocesssors workflow. 1. OP-TEE use case: DRM playback (secure data path). User wants to play a DRM-protected media file. Rights holders don't want to give user any means to get DRM-free copy of that media file. If you ever heard about Widevine on Android - that it is. Long story short, it is possible to decrypt, decode and display a video frame in a such way, that decrypted data will never be accessible to userspace, kernel or even to hypervisor. This is possible only when all data processing is done in secure mode, which leads us to OP-TEE or (another TEE). So, for each video frame media player should call OP-TEE with encrypted frame data. Good case: 24FPS movie, optimized data path: media player registers shared buffers in OP-TEE only once and then reuses them during every invocation. That would be one OP-TEE call per frame or 24 calls per second. Worst case: High frame rate movie (60 FPS), data path in not optimized. Media player registers shared buffer in OP-TEE, then asks it to process frame, then unregisters buffer. 60 * 3 = 180 calls per second. Сall is done using SMC instruction. Let's assume that OP-TEE mediator lives in Stubdom. There is how call sequence can look like: 1. DomU issues SMC, which is trapped by Hypervisor 2. Hypervisor uses standard approach with ring buffer and event mechanism to call Stubdom. Also it blocks DomU's vCPU which caused this trap. 3a. Stubdom mangles request and asks Hypervisor to issue real SMC (3b. Stubdom mangles request and issues SMC by itself - potentially insecure) 4. After real SMC, Hypervisor returns control back to Stubdom 5. Stubdom mangles return value and returns response to Hypervisor in a ring buffer 6. Hypervisor unblocks DomU's VCPU and schedules it. As you can see, there are 6 context switches (DomU->HYP->Stubdom->HYP->Stubdom->HYP->DomU). There are 2 VCPU switches (DomU->Stubdom->DomU). Both VCPU switches are governed by a scheduler. When I say "governed by scheduler" I imply that there are no guarantees that needed domain will be scheduled right now. This is sequence for one call. As you remember, there can be up to 180 such calls per second in this use case. That gives us 180 * 6 ~= 1000 context switches per second. 2. Coprocessor use case: coprocessor context switch. Lets assume that coprocessor was used by Dom1 and now it is time to switch context, so Dom2 can use it. Returning back to GPU case, if we want to show 60 FPS, then we need at least 60*N context switches, where N is number of domains that use GPU. This is lower margin, obviously. Context switch is done in two parts: "context switch from" and "context switch to". Context switch procedure is device-specific, so there should be driver for every supported device. This driver does actual work. We can't have this driver in hypervisor. Let's assume that driver is running in a Stubdom. Context switch is requested by the hypervisor. So, best-case scenario is following: 1. Hypervisor asks Stubdom to do "context switch from" 2. Stubdom sends event back to hypervisor when task is done (Hypervisor reconfigures IOMMU) 3. Hypervisor asks Stubdom to do "context switch to" 4. Stubdom sends event back to hypervisor when task is done You can't merge Stubdomain call to "context switch from/to", because between p.2 and p.3 hypervisor needs to reconfigure IOMMU for GPU. So, there are 4 context switches, two of them are governed by scheduler. Or this is 240 context switches per second per domain per coprocessor. As was said, this is lower margin. 3. Coprocessor use case: MMIO access from domain to a virtualized device. Usually communication between processor and coprocessor is done in the following way: processor writes command into a shared memory and than kick interrupt in coprocessor, coprocessor processes task, writes response back to a shared memory and issues IRQ to a processor. Coprocessor is kicked by writing to one of its registers that are mapped to a memory. In case if vcoproc is active right now, we *might* can pass this MMIO access right to it. But in our current case, we nevertheless need to trap this access and route them to the driver. If vcoproc is not active, we always need to route this MMIO access to the driver, because only driver knows what to do with this requests right now. So, summarizing, domain will write to MMIO range every time it wants something from coprocessor. There can be hundreds such calls for *one* frame (e.g. load texture, load shader, load geometry, run shader, repeat). How it looks: 1. DomU writes or reads to/from MMIO register. 2. XEN traps this access and notifies Stubdom (also it blocks DomU vcpu) 3. Stubdom analyzes request and does actual write (or stores value internally). 4. Stubdom sends event back to XEN 5. XEN unblocks DomU vcpu. That gives us four context switches (two of them are governed by scheduler). As I said, there can be hundreds such writes for every frame. Which gives us 100*60*4 = 24 000 switches per second per domain. This no lower margin, but it also not higher margin. 4. Coprocessor use case: Interrupt from virtualized device to a domain. As I said, coprocessor will send interrupt back, when in finishes a task. Again, driver needs to process this interrupt before forwarding it to the DomU: 1. XEN receives interrupt and routes it to Stubdom (probably vGIC can done this for us, so we will not trap into HYP). 2. Stubdom receives interrupt, handles it and asks XEN to inject it into DomU. Two context switches, both governed by scheduler. This is additional 12 000 switches per second. As you can see, the worst scenarios are 3 and 4. We are working to optimize them. Ideal solution will be eliminate them at all, or at least don't trap IRQs and MMIO access for active vcoproc. But we need to trap MMIO access for inactive vcoproc in any case. I think, how you have some understanding regarding our requirements. Please feel free to ask any questions. Also I want to say thank you to Oleksandr Andrushchenko and Andrii Anisov for briefing be about VCF workflows. On 16 June 2017 at 20:19, Stefano Stabellini <sstabellini@xxxxxxxxxx> wrote: > On Fri, 16 Jun 2017, Dario Faggioli wrote: >> On Thu, 2017-06-15 at 13:14 -0700, Stefano Stabellini wrote: >> > On Thu, 15 Jun 2017, Volodymyr Babchuk wrote: >> > > Hello Stefano, >> > > On 15 June 2017 at 21:21, Stefano Stabellini >> > > <sstabellini@xxxxxxxxxx> wrote: >> > > > Would you be up for joining a conf call to discuss EL0 apps and >> > > > stubdoms >> > > > on ARM in preparation for Xen Developer Summit? >> > > > >> > > > If so, would Wednesday the 28th of June at 9AM PST work for you? >> > > >> > > I would prefer later time (like 5PM), but 9AM also works for me. >> > >> > >> > Wait, did you get the timezone right? >> > >> > 1) 9AM PST = 5PM London = 7PM Kyiv >> > >> Count me in. >> >> It would be great if someone could send an meeting invite, so that my >> mailer will do the timezone conversion and set reminders, and I don't >> risk showing up on the wrong day at the wrong time. :-P > > I'll do. -- WBR Volodymyr Babchuk aka lorc [+380976646013] mailto: vlad.babchuk@xxxxxxxxx _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.