[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] xen/arm: Handling cache maintenance instructions by set/way

On 07/12/17 13:52, Julien Grall wrote:
> (+ Marc)
> Hi,
> @Marc: My Arm cache knowledge is somewhat limited. Feel free to correct 
> me if I am wrong.
> Before answering to the rest of the e-mail, let me reinforce what I said 
> in my first e-mail. Set/Way are very complex to emulate and an OS using 
> them should never expect good performance in virtualization context. The 
> difficulty is clearly spell out in the Arm Arm.

It is actually even worse than that. Software using set/way operations
is simply not virtualizable, full stop. Yes, we paper over it in ugly
ways, but nobody should really use set/way.

There is exactly one case where set/way makes sense, and that's when
you're the only CPU left in the system, your MMU is off, and you're
about to go down.

> So the main goal here is to workaround those software.

Quite. Said SW is usually a 32bit Linux kernel.

> On 06/12/17 17:49, George Dunlap wrote:
>> On 12/06/2017 12:58 PM, Julien Grall wrote:
>>> Hi George,
>>> On 12/06/2017 12:28 PM, George Dunlap wrote:
>>>> On 12/05/2017 06:39 PM, Julien Grall wrote:
>>>>> Hi all,
>>>>> Even though it is an Arm failure, I have CCed x86 folks to get feedback
>>>>> on the approach. I have a WIP branch I could share if that interest
>>>>> people.
>>>>> Few months ago, we noticed an heisenbug on jobs run by osstest on the
>>>>> cubietrucks (see [1]). From the log, we figured out that the guest vCPU
>>>>> 0 is in data/prefetch abort state at early boot. I have been able to
>>>>> reproduce it reliably, although from the little information I have I
>>>>> think it is related to a cache issue because we don't trap cache
>>>>> maintenance instructions by set/way.
>>>>> This is a set of 3 instructions (clean, clean & invalidate, invalidate)
>>>>> working on a given cache level by S/W. Because the OS is not allowed to
>>>>> infer the S/W to PA mapping, it can only use S/W to nuke the whole
>>>>> cache. "The expected usage of the cache maintenance that operate by
>>>>> set/way is associated with powerdown and powerup of caches, if this is
>>>>> required by the implementation" (see D3-2020 ARM DDI 0487B.b).
>>>>> Those instructions will target a local processor and usually working in
>>>>> batch for nuking the cache. This means if the vCPU is migrated to
>>>>> another pCPU in the middle of the process, the cache may not be cleaned.
>>>>> This would result to data corruption and potential crash of the OS.
>>>> I don't quite understand the failure mode here: Why does vCPU migration
>>>> cause cache inconsistency in the middle of one of these "cleans", but
>>>> not under normal operation?
>>> Because they target a specific S/W cache level whereas other cache
>>> operations are working with VA.
>>> To make it short, the other VA cache instructions will work to Poinut of
>>> Coherency/Point of Unification and guarantee that the caches will be
>>> consistent. For more details see B2.2.6 in ARM DDI 046C.c.
>> I skimmed that section, and I'm not much the wiser.
>> Just to be clear, this is my question.
>> Suppose we have the following sequence of events (where vN[pM] means
>> vcpu N running on pcpu M):
>> Start with A == 0
>> 1. v0[p1] Read A
>>    p1 has 'A==0' in the cache
>> 2. scheduler migrates v1 to p0
>> 3. v0[p0] A=2
>>    p0 has 'A==2' in the cache
>> 4 scheduler migrates v0 to p1
>> 5 v0[p1] Read A
>> Now, I presume that with the guest not doing anything, the Read of A at
>> #5 will end up as '2'; i.e., behind the scenes somewhere, either by Xen
>> or by the hardware, between #1 and #5, p0's version of A gets "cleaned"
>> and p1's version of A gets "invalidated" (to use the terminology from
>> the section mentioned above).
> Caches on Arm are coherent and are controlled by the attributes in the 
> page-tables. Imagine the region is normal cacheable and inner-shareable, 
> a data synchronization barrier in #4 will ensure the visibility of the A 
> to p1. So A will be read as 2.
>> So my question is, how does *adding* cache flushing of any sort end up
>> violating the integrity in a situation like the above?
> Because the integrity is based on the memory attributes in the 
> page-tables. S/W instructions work directly on the cache and will break 
> the coherency. Marc pointed me to his talk [1] that explain cache on Arm 
> and also the set/way problem (see from slide 8).

On top of bypassing the coherency, S/W CMOs do not prevent lines from
migrating from one CPU to another. So you could happily be flushing by
S/W, and still end up with dirty lines in your cache. Success!

At that point, performance is the least of your worries.

>>>>> For those been worry about the performance impact, I have looked at the
>>>>> current use of S/W instructions:
>>>>>       - Linux Arm64: The last used in the kernel was beginning of 2015
>>>>>       - Linux Arm32: Still use S/W for boot and secondary CPU
>>>>> bring-up. No
>>>>> plan to change.
>>>>>       - UEFI: A couple of use in UEFI, but I have heard they plan to
>>>>> remove them (need confirmation).
>>>>> I haven't looked at all the OSes. However, given the Arm Arm clearly
>>>>> state S/W instructions are not easily virtualizable, I would expect
>>>>> guest OSes developers to try there best to limit the use of the
>>>>> instructions.
>>>>> To limit the performance impact, we could introduce a guest option to
>>>>> tell whether the guest will use S/W. If it does plan to use S/W, PoD
>>>>> will be disabled.
>>>>> Now regarding the hardware domain. At the moment, it has its RAM direct
>>>>> mapped. Supporting direct mapping in PoD will be quite a pain for a
>>>>> limited benefits (see why above). In that case I would suggest to impose
>>>>> vCPU pinning for the hardware domain if the S/W are expected to be used.
>>>>> Again, a command line option could be introduced here.
>>>>> Any feedbacks on the approach will be welcomed.
>>>> I still don't entirely understand the underlying failure mode, but there
>>>> are a couple of things we could consider:
>>>> 1. Automatically disabling 'vcpu migration' when caching is turned off.
>>>> This wouldn't prevent a vcpu from being preempted, just from being run
>>>> somewhere else.
>>> This suggest the guest will directly perform S/W, right? So you leave
>>> the possibility to the guest to flush all caches the vCPU can access.
>>> This an easy way for the guest to affect the cache entry of other guests.
>>> I think this would help some potential data attack.
>> Well, it's the equivalent of your "imposing vcpu pinning" solution
>> above, but only temporary.  Was that suggestion meant to allow the
>> hardware domain to directly perform S/W?
> Yes for the hardware domain only because it is more trusted IHMO. I 
> though you meant for every guests. The problem I can see here is you 
> would need to trap cache-toggling. When trapping that, you have to trap 
> all the virtual memory traps. This means:
> Non-secure EL1 using AArch64: SCTLR_EL1, TTBR0_EL1, TTBR1_EL1, TCR_EL1, 
> ESR_EL1,
> Non-secure EL1 using AArch32: SCTLR, TTBR0, TTBR1, TTBCR, TTBCR2, DACR, 
> Those registers are accessed very often, so you will have a performance 
> impact for the whole life of the guest.
> However, looking at Marc's slide. This would not work when booting 
> 32-bit hardware domain on ARMv8 because system caches might be present.

Yes, and this further outlines why using S/W is b0rken. You're not
guaranteed that all your cache hierarchy will implement S/W.

>>>> 2. It sounds like rather than using PoD, you could use the
>>>> "misconfigured p2m table" technique that x86 uses: set bits in the p2m
>>>> entry which cause a specific kind of HAP fault when accessed.  The fault
>>>> handler then looks in the p2m entry, and if it finds an otherwise valid
>>>> entry, it just fixes the "misconfigured" bits and continues.
>>> I thought about this. But when do you set the entry to misconfigured?
>>> If you take the example of Linux 32-bit. There are a couple of full
>>> cache clean during the boot of uni-processor. So you would need to go
>>> through the p2m multiple time and reset the access bits.
>> Do you want to reset the p2m multiple times?  I thought the goal was
>> simply to keep the amount of p2m space you need to flush to a minimum;
>> if you expect the memory which has been faulted in by the *last* flush
>> to be relatively small, you could just always flush all memory that had
>> been touched to that point.
>> If you *do* need to go through the p2m multiple times, then
>> misconfiguration is a much better option than PoD.  In PoD, once a page
>> has data on it, it can't be removed from the p2m anymore.  For the
>> misconfiguration technique, you can go through and misconfigure the
>> entries in the top-level p2m table as many times as you want.  The whole
>> reason for doing it on x86 is that it's a relatively lightweight
>> operation: we use it to modify MMIO mappings, to enable or disable
>> logdirty for migrate, &c.
> Does this also work when you share the page-tables with the IOMMU? It 
> just occurred to me that for both PoD and "misconfigured bits" we would 
> get into trouble because page-tables are shared with the IOMMU.
> But I guess, it would be acceptable to say "you use S/W instructions in 
> your OS, so you have to pay a worst performance price unless you fix 
> your OS".

I think that's a very valid argument. It is definitely a case of "Don't
do that". Yes, a 32bit Linux kernel will be slow to boot under Xen. If
people care about speed, they will fix it (or boot a non compressed
guest kernel). I think correctness matters a lot more than speed.


Jazz is not dead. It just smells funny...

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.