Xen project Mailing List

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

To: George Dunlap <george.dunlap@xxxxxxxxxx>, Marc Zyngier <marc.zyngier@xxxxxxx>, Julien Grall <julien.grall@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

From: Andre Przywara <andre.przywara@xxxxxxxxxx>

Date: Mon, 11 Dec 2017 11:10:59 +0000

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 11 Dec 2017 12:32:44 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi, On 08/12/17 10:56, George Dunlap wrote: > On 12/07/2017 07:21 PM, Marc Zyngier wrote: >> On 07/12/17 18:06, George Dunlap wrote: >>> On 12/07/2017 04:58 PM, Marc Zyngier wrote: >>>> On 07/12/17 16:44, George Dunlap wrote: >>>>> On 12/07/2017 04:04 PM, Julien Grall wrote: >>>>>> Hi Jan, >>>>>> >>>>>> On 07/12/17 15:45, Jan Beulich wrote: >>>>>>>>>> On 07.12.17 at 15:53, <marc.zyngier@xxxxxxx> wrote: >>>>>>>> On 07/12/17 13:52, Julien Grall wrote: >>>>>>>> There is exactly one case where set/way makes sense, and that's when >>>>>>>> you're the only CPU left in the system, your MMU is off, and you're >>>>>>>> about to go down. >>>>>>> >>>>>>> With this and ... >>>>>>> >>>>>>>> On top of bypassing the coherency, S/W CMOs do not prevent lines from >>>>>>>> migrating from one CPU to another. So you could happily be flushing by >>>>>>>> S/W, and still end up with dirty lines in your cache. Success! >>>>>>> >>>>>>> ... this I wonder what value emulating those insns then has in the first >>>>>>> place. Can't you as well simply skip and ignore them, with the same >>>>>>> (bad) result? >>>>>> >>>>>> The result will be much much worst. Here a concrete example with a Linux >>>>>> Arm 32-bit: >>>>>> >>>>>> 1) Cache enabled >>>>>> 2) Decompress >>>>>> 3) Nuke cache (S/W) >>>>>> 4) Cache off >>>>>> 5) Access new kernel >>>>>> >>>>>> If you skip #3, the decompress data may not have reached the memory, so >>>>>> you would access stall data. >>>>>> >>>>>> This would effectively mean we don't support Linux Arm 32-bit. >>>>> >>>>> So Marc said that #3 "doesn't make sense", since although it might be >>>>> the only cpu on in the system, you're not "about to go down"; but Linux >>>>> 32-bit is doing that anyway. >>>> >>>> "Doesn't make sense" on an ARMv7+ with SMP. That code dates back to >>>> ARMv4, and has been left untouched ever since. "If it ain't broke..." >>>> >>>>> It sounds like from the slides the purpose of #3 might be to get stuff >>>>> out of the D-cache into the I-cache. But why is the cache turned off? >>>> >>>> Linux mandates that the kernel in entered with the MMU off. Which has >>>> the effect of disabling the caches too (VIVT caches and all that jazz). >>>> >>>>> And why doesn't Linux use the VA-based flushes rather than the S/W >>>>> flushes? >>>> >>>> Linux/arm64 does. Changing the 32bit port to use VA CMOs would probably >>>> break stuff from the late 90s, so that's not going to happen. These >>>> days, I tend to pick my battles... ;-) >>> >>> OK, so let me try to state this "forwards" for those of us not familiar >>> with the situation: >>> >>> 1. Linux expects to start in 'linear' mode, with the MMU disabled. >>> >>> 2. On ARM, disabling the MMU disables caching (!). But disabling >>> caching doesn't flush the cache; it just means the cache is bypassed (!). >>> >>> 3. Which means for Linux on ARM, after unzipping the kernel image, you >>> need to flush the cache before disabling the MMU and starting Linux proper >>> >>> 4. For historical reasons, 32-bit ARM Linux uses the S/W instructions to >>> flush the cache. This still works on 32-bit hardware, and so the Linux >>> maintainers are loathe to change it, even though more reliable VA-based >>> instructions are available (?). >> >> It also works on 64bit HW. It is just not easily virtualizable, which is >> why we've removed all S/W from the 64bit Linux port a while ago. > > From the diagram in your talk, it looked like the "flush the cache" > operation *doesn't* work anywhere that has a "system cache", even on > bare metal. What Marc probably meant is that they still work *within the architectural limits* that s/w operations provide: - S/W CMOs are not broadcasted, so in a live SMP system they are probably not doing what you expect them to do. This isn't an issue for a 32-bit Linux kernel decompressor, because this is UP still at this point. - S/W CMOs are optional to implement for system caches. As Marc mentioned, there are not many 32-bit systems with a system cache out there. And on those systems you can still boot an uncompressed kernel or use gzip-ed kernel and let the bootloader (grub, U-Boot) decompress it. On the other hand there seem to be a substantial number of (older) 32-bit systems where VA CMOs have issues. The problem now is that for the "32-bit kernel on a 64-bit hypervisor" cache those two assumptions are not true: The system has multiple CPUs running already, also 64-bit hardware is much more likely to have system caches. So this is mostly a virtualization problem and thus should be solved here. To help assessing the benefits of adding PoD to Xen: I did some tracing on Friday with a 32-bit kernel on a (64-bit) Juno with KVM. I see *four* full cache cleans very early on each boot (first s/w op + caches turned on, twice), plus one cache clean when each (v)CPU is brought online (due to the initial "turn MMU and cache on" operation). During the runtime of the kernel there are no s/w ops, except for (v)CPU off/on-lining (echo [01] > /sys/devices/system/cpu/cpu<n>/online). I believe these are bogus, as I see the caches still being on, but that's how it is. Also this is probably not performance critical due to the nature of this operation. Having PoD at this point would be quite helpful, as very early at boot we don't expect much memory to be already used, so the "full VA space cache clean" doesn't have much to do. This leads to a 32-bit kernel boot in KVM to not be noticeably slower than a 64-bit kernel boot. But on the other hand we had PoD naturally already in KVM, so this came at no cost. So I believe it would be worth to investigate what the actual impact is on booting a 32-bit kernel, with emulating s/w ops like KVM does (see below), but cleaning the *whole VA space*. If this is somewhat acceptable (I assume we have no more than 2GB for a typical ARM32 guest), it might be worth to ignore PoD, at least for now and to solve this problem (and the IOMMU consequences). This assumes that a single "full VA flush" cannot be abused as a DOS by a malicious guest, which should be investigated independently (as this applies to a PoD implementation as well). Somewhat optional read for the background of how KVM optimized this ([1]): KVM's solution to this problem works under the assumption that s/w operations with the caches (and MMU on) are not really meaningful, so we don't bother emulating them to the letter. Also we assume that the purpose of s/w CMOs is to clean the whole cache. So KVM does two things to avoid too much work: - The first trapped s/w op flushes the whole guest VA space. It then turns "VM op" traps on, to detect when the caches get turned on. This basically does the work ("flush my whole cache") already on the first s/w op. Further trapped s/w ops are treated as NOPs then. - When a trapped VM op signals that the caches are turned on again, we also clean the whole cache. We then turn VM op trapping *off* again. The next trapped s/w op would turn it back on. Those two features are pretty straight forward to implement, avoid actual s/w operations most of the time (all but the first s/w op are emulated as NOPs), but still makes it safe within the architectural limits. Plus this code is not normally triggered during the actual kernel runtime, but only on early boot (decompressor plus SMP bringup). >>> 6. Rather than fix this in Linux, KVM has added a work-around in which >>> the *hypervisor* flushes the caches at certain points (!!!). Julien is >>> looking into doing the same with Xen. >> >> The "at certain points" doesn't quite describe it. We fully emulate S/W >> instruction using the biggest hammer we can find. > > Oh, I thought Julien was saying something about flushing the guest's RAM > every time caching was enabled or disabled. Yes, that's what it does ([2]), but usually that's at early boot and we don't have many pages actually populated at this point. Hence Julien's PoD proposal to allow using the same optimization. Cheers, Andre. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/virt/kvm/arm/mmu.c#n1960 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/virt/kvm/arm/mmu.c#n382 >>> Given the variety of hardware that Linux has to run on, it's hard to >>> understand why 1) 32-bit ARM Linux couldn't detect if it would be >>> appropriate to use VA-based instructions rather than S/W instructions 2) >>> There couldn't at least be a Kconfig option to use VA instructions >>> instead of S/W instructions. >> >> [Linux hat on] >> >> 1) There is hardly anything to detect. Both sets of CMOs are available >> on a moderately recent implementation. What you'd want to detect is the >> the kernel is "virtualizable", which is not an easy task. > <snip> >> An alternative option would be to switch to VA CMOs if compiled for >> ARMv7 (and maybe v6), assuming that doesn't have any horrible side >> effect with broken cache implementations (and there is a few out there). >> You'll have to check that this doesn't regress on any existing HW. > > So the idea would be to use the VA-based operations if available, and > then special-case specific chipsets known to have issues. Linux (and > Xen and...) end up doing this for lots of different kinds of hardware; > this would be no different. > >> 2) Kconfig options are the way to hell. It took us 5 years to get a >> 32bit kernel that would boot on about anything, and we're not going to >> go back. > > Well, at the moment you *don't* have a 32-bit kernel that will boot on > anything. It won't boot (it sounds like) on any 32-bit system that has > a system cache, including a 64-bit hypervisor providing a 32-bit guest. > > Alternately, would it make sense to have a PV "cache flush" operation > for hypervisors? x86 has a way to expose hypervisor capabilities via > specific CPUID leaves. Does anything like this exist for ARM? If so, > the code could be, "If virtualized and hypervisor provides PV cache > flush, use that. Otherwise, fall back to S/W operation." > >> Of course, none of that will solve the most important issue, which is to >> boot an unmodified kernel from yesterday to install a distribution. If >> you want to be able to do that, you'll have to use the aforementioned >> hammer. > > Well it will take time to code up a solution and get *that* into user's > hands as well. I would think the fastest way to get *most* distros > working would be to open a ticket saying it's broken on virtual > hardware, and asking them to apply a patch. Then the priority of > getting more "enterprisey" distros working if and when. > > Just to be clear -- I'm just trying to help push to explore other > options here. I'm not opposed to Julien or someone making a work-around > in Xen. But it's quite a bit of effort to achieve a pretty crappy end, > so I think it's worth exploring what kind of effort we could spend > achieving a "proper" fix first. > > (Thanks also for taking the time to help explain this.) > > -George > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.