Xen project Mailing List

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

To: Julien Grall <julien.grall@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Andre Przywara <andre.przywara@xxxxxxx>, Tim Deegan <tim@xxxxxxx>

From: Marc Zyngier <marc.zyngier@xxxxxxx>

Date: Thu, 7 Dec 2017 14:53:05 +0000

Delivery-date: Thu, 07 Dec 2017 14:53:30 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 07/12/17 13:52, Julien Grall wrote: > (+ Marc) > > Hi, > > @Marc: My Arm cache knowledge is somewhat limited. Feel free to correct > me if I am wrong. > > Before answering to the rest of the e-mail, let me reinforce what I said > in my first e-mail. Set/Way are very complex to emulate and an OS using > them should never expect good performance in virtualization context. The > difficulty is clearly spell out in the Arm Arm. It is actually even worse than that. Software using set/way operations is simply not virtualizable, full stop. Yes, we paper over it in ugly ways, but nobody should really use set/way. There is exactly one case where set/way makes sense, and that's when you're the only CPU left in the system, your MMU is off, and you're about to go down. > So the main goal here is to workaround those software. Quite. Said SW is usually a 32bit Linux kernel. > > On 06/12/17 17:49, George Dunlap wrote: >> On 12/06/2017 12:58 PM, Julien Grall wrote: >>> Hi George, >>> >>> On 12/06/2017 12:28 PM, George Dunlap wrote: >>>> On 12/05/2017 06:39 PM, Julien Grall wrote: >>>>> Hi all, >>>>> >>>>> Even though it is an Arm failure, I have CCed x86 folks to get feedback >>>>> on the approach. I have a WIP branch I could share if that interest >>>>> people. >>>>> >>>>> Few months ago, we noticed an heisenbug on jobs run by osstest on the >>>>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU >>>>> 0 is in data/prefetch abort state at early boot. I have been able to >>>>> reproduce it reliably, although from the little information I have I >>>>> think it is related to a cache issue because we don't trap cache >>>>> maintenance instructions by set/way. >>>>> >>>>> This is a set of 3 instructions (clean, clean & invalidate, invalidate) >>>>> working on a given cache level by S/W. Because the OS is not allowed to >>>>> infer the S/W to PA mapping, it can only use S/W to nuke the whole >>>>> cache. "The expected usage of the cache maintenance that operate by >>>>> set/way is associated with powerdown and powerup of caches, if this is >>>>> required by the implementation" (see D3-2020 ARM DDI 0487B.b). >>>>> >>>>> Those instructions will target a local processor and usually working in >>>>> batch for nuking the cache. This means if the vCPU is migrated to >>>>> another pCPU in the middle of the process, the cache may not be cleaned. >>>>> This would result to data corruption and potential crash of the OS. >>>> >>>> I don't quite understand the failure mode here: Why does vCPU migration >>>> cause cache inconsistency in the middle of one of these "cleans", but >>>> not under normal operation? >>> >>> Because they target a specific S/W cache level whereas other cache >>> operations are working with VA. >>> >>> To make it short, the other VA cache instructions will work to Poinut of >>> Coherency/Point of Unification and guarantee that the caches will be >>> consistent. For more details see B2.2.6 in ARM DDI 046C.c. >> >> I skimmed that section, and I'm not much the wiser. >> >> Just to be clear, this is my question. >> >> Suppose we have the following sequence of events (where vN[pM] means >> vcpu N running on pcpu M): >> >> Start with A == 0 >> >> 1. v0[p1] Read A >> p1 has 'A==0' in the cache >> 2. scheduler migrates v1 to p0 >> 3. v0[p0] A=2 >> p0 has 'A==2' in the cache >> 4 scheduler migrates v0 to p1 >> 5 v0[p1] Read A >> >> Now, I presume that with the guest not doing anything, the Read of A at >> #5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen >> or by the hardware, between #1 and #5, p0's version of A gets "cleaned" >> and p1's version of A gets "invalidated" (to use the terminology from >> the section mentioned above). > > Caches on Arm are coherent and are controlled by the attributes in the > page-tables. Imagine the region is normal cacheable and inner-shareable, > a data synchronization barrier in #4 will ensure the visibility of the A > to p1. So A will be read as 2. > >> >> So my question is, how does *adding* cache flushing of any sort end up >> violating the integrity in a situation like the above? > > Because the integrity is based on the memory attributes in the > page-tables. S/W instructions work directly on the cache and will break > the coherency. Marc pointed me to his talk [1] that explain cache on Arm > and also the set/way problem (see from slide 8). On top of bypassing the coherency, S/W CMOs do not prevent lines from migrating from one CPU to another. So you could happily be flushing by S/W, and still end up with dirty lines in your cache. Success! At that point, performance is the least of your worries. > >> >>>>> For those been worry about the performance impact, I have looked at the >>>>> current use of S/W instructions: >>>>> - Linux Arm64: The last used in the kernel was beginning of 2015 >>>>> - Linux Arm32: Still use S/W for boot and secondary CPU >>>>> bring-up. No >>>>> plan to change. >>>>> - UEFI: A couple of use in UEFI, but I have heard they plan to >>>>> remove them (need confirmation). >>>>> >>>>> I haven't looked at all the OSes. However, given the Arm Arm clearly >>>>> state S/W instructions are not easily virtualizable, I would expect >>>>> guest OSes developers to try there best to limit the use of the >>>>> instructions. >>>>> >>>>> To limit the performance impact, we could introduce a guest option to >>>>> tell whether the guest will use S/W. If it does plan to use S/W, PoD >>>>> will be disabled. >>>>> >>>>> Now regarding the hardware domain. At the moment, it has its RAM direct >>>>> mapped. Supporting direct mapping in PoD will be quite a pain for a >>>>> limited benefits (see why above). In that case I would suggest to impose >>>>> vCPU pinning for the hardware domain if the S/W are expected to be used. >>>>> Again, a command line option could be introduced here. >>>>> >>>>> Any feedbacks on the approach will be welcomed. >>>> >>>> I still don't entirely understand the underlying failure mode, but there >>>> are a couple of things we could consider: >>>> >>>> 1. Automatically disabling 'vcpu migration' when caching is turned off. >>>> This wouldn't prevent a vcpu from being preempted, just from being run >>>> somewhere else. >>> >>> This suggest the guest will directly perform S/W, right? So you leave >>> the possibility to the guest to flush all caches the vCPU can access. >>> This an easy way for the guest to affect the cache entry of other guests. >>> >>> I think this would help some potential data attack. >> >> Well, it's the equivalent of your "imposing vcpu pinning" solution >> above, but only temporary. Was that suggestion meant to allow the >> hardware domain to directly perform S/W? > > Yes for the hardware domain only because it is more trusted IHMO. I > though you meant for every guests. The problem I can see here is you > would need to trap cache-toggling. When trapping that, you have to trap > all the virtual memory traps. This means: > > Non-secure EL1 using AArch64: SCTLR_EL1, TTBR0_EL1, TTBR1_EL1, TCR_EL1, > ESR_EL1, > FAR_EL1, AFSR0_EL1, AFSR1_EL1, MAIR_EL1, AMAIR_EL1, CONTEXTIDR_EL1. > Non-secure EL1 using AArch32: SCTLR, TTBR0, TTBR1, TTBCR, TTBCR2, DACR, > DFSR, > IFSR, DFAR, IFAR, ADFSR, AIFSR, PRRR, NMRR, MAIR0, MAIR1, AMAIR0, AMAIR1, > CONTEXTIDR. > > Those registers are accessed very often, so you will have a performance > impact for the whole life of the guest. > > However, looking at Marc's slide. This would not work when booting > 32-bit hardware domain on ARMv8 because system caches might be present. Yes, and this further outlines why using S/W is b0rken. You're not guaranteed that all your cache hierarchy will implement S/W. > >> >>>> 2. It sounds like rather than using PoD, you could use the >>>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m >>>> entry which cause a specific kind of HAP fault when accessed. The fault >>>> handler then looks in the p2m entry, and if it finds an otherwise valid >>>> entry, it just fixes the "misconfigured" bits and continues. >>> >>> I thought about this. But when do you set the entry to misconfigured? >>> >>> If you take the example of Linux 32-bit. There are a couple of full >>> cache clean during the boot of uni-processor. So you would need to go >>> through the p2m multiple time and reset the access bits. >> >> Do you want to reset the p2m multiple times? I thought the goal was >> simply to keep the amount of p2m space you need to flush to a minimum; >> if you expect the memory which has been faulted in by the *last* flush >> to be relatively small, you could just always flush all memory that had >> been touched to that point. >> >> If you *do* need to go through the p2m multiple times, then >> misconfiguration is a much better option than PoD. In PoD, once a page >> has data on it, it can't be removed from the p2m anymore. For the >> misconfiguration technique, you can go through and misconfigure the >> entries in the top-level p2m table as many times as you want. The whole >> reason for doing it on x86 is that it's a relatively lightweight >> operation: we use it to modify MMIO mappings, to enable or disable >> logdirty for migrate, &c. > > Does this also work when you share the page-tables with the IOMMU? It > just occurred to me that for both PoD and "misconfigured bits" we would > get into trouble because page-tables are shared with the IOMMU. > > But I guess, it would be acceptable to say "you use S/W instructions in > your OS, so you have to pay a worst performance price unless you fix > your OS". I think that's a very valid argument. It is definitely a case of "Don't do that". Yes, a 32bit Linux kernel will be slow to boot under Xen. If people care about speed, they will fix it (or boot a non compressed guest kernel). I think correctness matters a lot more than speed. Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.