[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [XEN v2] xen/arm64: io: Decode 32-bit ldr/str post-indexing instructions
Hi Andre/All, Many thanks for your feedback. On 11/01/2022 12:52, Andre Przywara wrote: On Mon, 10 Jan 2022 14:33:11 +0000 Ayan Kumar Halder <ayan.kumar.halder@xxxxxxxxxx> wrote: Hi Ayan,Many thanks for your inputs. It is making better sense now. Much appreciated. A few questions/clarifications :- On 06/01/2022 15:33, Andre Przywara wrote:On Wed, 5 Jan 2022 16:55:11 +0000 Ayan Kumar Halder<ayan.kumar.halder@xxxxxxxxxx> wrote: Hi,Thank you so much for your feedback. I need a couple of clarifications before I can start with the v3 patch. On 08/12/2021 12:00, Andre Przywara wrote:On Mon, 6 Dec 2021 19:31:06 +0000 Julien Grall<julien@xxxxxxx> wrote: Hi,On 01/12/2021 08:41, Bertrand Marquis wrote:Hi Ayan,On 30 Nov 2021, at 19:13, Ayan Kumar Halder<ayan.kumar.halder@xxxxxxxxxx> wrote: Hi Andre, Thanks for your comments. They are useful. On 30/11/2021 09:49, Andre Przywara wrote:On Mon, 29 Nov 2021 19:16:38 +0000 Ayan Kumar Halder<ayan.kumar.halder@xxxxxxxxxx> wrote: Hi,At the moment, Xen is only handling data abort with valid syndrome (i.e. ISV=0). Unfortunately, this doesn't cover all the instructions a domain could use to access MMIO regions. For instance, Xilinx baremetal OS will use: volatile u32 *LocalAddr = (volatile u32 *)Addr; *LocalAddr = Value; This leave the compiler to decide which store instructions to use.As mentioned in the other email, this is wrong, if this points to MMIO: don't let the compiler do MMIO accesses. If a stage 2 fault isn't in an MMIO area, you should not see traps that you cannot handle already. So I don't think it's a good idea to use that as an example. And since this patch only seems to address this use case, I would doubt its usefulness in general.Yes, I should have fixed the comment. Currently, I am testing with baremetal app which uses inline assembly code with post indexing instructions, to access the MMIO. ATM, I am testing with 32 bit MMIO only. On the usefulness, I am kind of torn as it is legitimate for post indexing instructions to be used in an inline-assembly code for accessing MMIO. However, that may not be something commonly seen. @Stefano/Bertrand/Julien/Volodymyr :- As you are the Arm mantainers, can you comment if we should have decoding logic or not ?Andre gave you the official statement from Arm and there is nothing more I can say.I think this would be handy for other hypervisor and OS developper to know what they can expect when running in a virtualized environment. So would it be possible to update the Arm Arm reflecting this statement?I don't think it's within the scope of the ARM ARM to say that. It just says that "there is no syndrome information", and your mileage may vary in working around that. Personally I would say that if you expect your software to work nicely under a hypervisor, then just avoid those instructions. The Linux kernel certainly did so. You can try to do instruction emulation, but doing this right can get tricky quickly: think about I$/D$ coherency, MMU on or off, etc.I am trying to get all the restrictions that need to be checked. I have referred https://developer.arm.com/documentation/dui0802/a/A64-General-Instructions/Register-restrictions-for-A64-instructions?lang=en and "Arm A64 Instruction Set Architecture - DDI 0596" - LDR (immediate). So far I only see the following restrictions:- Rn -ne Rt Rt -ne SP You had mentioned the following cases :- 1. XZR vs SP - I see that both these refer to register no 31. Xen gets the register number (for Rn/Rt) only, so I am not sure what is to be done here.But the emulation code in Xen needs to know whether number 31 refers to the stack pointer or to the zero register. This depends on the context of the instruction. It's nothing magical about it, you just need to figure out which it is in your case and make sure that the code uses the right value. In the LDR case n==31 means SP, but t==31 means XZR. The emulation would need to consider this.I see what you mean. I had a look at xen/tools/qemu-xen-dir-remote/target/arm/cpu.h, CPUARMState {}. I don't see anywhere we need to refer register by its name. IIUC, the instruction decoding logic needs to know the register number only.But there is of course still a difference between them: they refer to two separate registers, they just share the encoding of x31 in the instruction. Each instruction then *knows* what register it needs to go to when it sees number 31. By its very nature, you don't need to save XZR, so GPR[31] is typically the SP, if you have an array holding register values, and you probably handle the XZR case in code.In fact, Xen would only read CPUARMState->xregs[31] for SP/XZR in Aarch64 context. Let me know what I might be missing here. Beata/Richard - I see you have committed most recently on the file. Please feel free to add your inputs.2. MMU on or off - As I see in try_handle_mmio(), one gets the physical address in gpa. So I am not sure what is to be done here.You *might* be fine, if Xen takes care of that, I don't know. But it needs to explicitly consider those two cases - and also make sure that caching effects are cared for.3. I/D coherency - I don't understand how this affects instruction decoding.In the ARM architecture the I cache and D cache are not necessarily coherent. So whatever the CPU reads via the data bus does not need to be the same data that the instruction fetcher read a while ago. And while there is a well-known sequence to make the current data side visible to the instruction side, there is nothing architectural for the other way around. For example the CPU fetches a bunch of instructions into the I cache, then consumes it from there. Now some code changes those instruction words in "memory". Without explicit D-cache-clean-to-the-PoU and I-cache-invalidate+ISB the CPU might still execute the old instructions.This makes sense. Referring https://developer.arm.com/documentation/den0024/a/Caches/Cache-maintenance, I assume that the following instructions need to be executed before the instruction is decoded. Xn <--- contains the address of the instruction to be decoded. STR Wt, [Xn] DC CVAU, Xn // Clean datacache by VA to point of unification (PoU) DSB ISH // Ensure visibility of the datacleaned from cache IC IVAU, Xn // Invalidate instruction cache by VA to PoU DSB ISH // Ensure completion of the invalidations ISB // Synchronize the fetched instruction streamThat is what I sketched above: the typical sequence to make sure the instruction fetcher reads the right instructions, after you updated them. BUT this is not what we need here, we need to other way around: read the instruction that the CPU executed, but via the data bus, from memory. And I wouldn't know of a neat architectural way to ensure this.But if your emulation code tries to read the instruction from memory, it fetches the data-visible side of it - which is not what the CPU executed, so end up emulating the wrong thing.I am guessing that this should not be a very unusual case. Isn't this similar to a host processor loading and modifying a firmware in the memory. The firmware can then be executed by a different micro-controller or some programmable engine. I mean the data pipeline or instruction pipeline should read the same contents from memory.... if they are coherent. Otherwise you need explicit maintenance, see above. I mean it's the same situation with an external processor: this core reads instructions from memory and is probably buffering/caching them (crucial for performance). Now you need to tell if that buffer might contain stale data, and if the data in memory has changed meanwhile.Please also keep in mind that the architecture does not *guarantee* coherency, but in reality it might actually be coherent - either because of the particular silicon implementation, or because of side effects in the system (data cache line being evicted, ISB executed). So seeing the effects of coherency in your testing does not mean you are safe, it might just happen by chance.Is there some good way to test this to ensure the best possible reliability ?That's the problem, I think: You can reason on the basis of the architecture as described in the ARM ARM, but this brings you to the problems above. You can now try to hand wave away the problems, by stating that this will never happen in practice, but you then leave the steady ground the architecture gives you. That's why I was asking before if this is really necessary code to have, or if you would just be better off fixing the guests. Cheers, AndrePlease help me to understand further. - AyanI will leave this decision to Stefano and Julien.I have had a chat on IRC with Stefano about this. I think the main sticking point is the Arm Arm doesn't clearly state those instructions should not be used by a virtualized OS on MMIO regions.I don't understand why the ARM ARM would need to say that. Certainly you realise that immediately when trying to use them, and apparently it was not a problem in the last 8ish years of Xen/ARM's existence. So it's your decision on having the emulation, I personally would only do it when there is a *good* use case. And please apply the demanded scrutiny on the review - including all the corner cases like Rn=Rt, XZR vs. SP (as Jan said) and possibly MMU status. I have submitted a v3 patch for this addressing most of the comments."[XEN v3] xen/arm64: io: Decode ldr/str post-indexing instructions". Please have a look. - Ayan Cheers, AndreTo me, this topic looks similar to the set/way instruction dilemma. They are a pain to virtualize (and the Arm Arm clearly hint it) but we had to do it because some OSes relied on them. I think the main difference is the Arm Arm doesn't hint they should not be used (it only says a valid syndrome is not provided) and the implementation should hopefully be smaller and self-contained. So I would be inclined to allow Xen to decode post-indexing instructions (pending the review). Cheers,
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |