[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Minios-devel] [UNIKRAFT PATCH] plat/kvm: Fix current thread retrieval in interrupt context on x86_64





On 20.06.19 12:30, Costin Lupu wrote:
Hi Simon,

Please see inline.

On 6/19/19 12:35 AM, Simon Kuenzer wrote:
Hey Costin,

On 17.06.19 21:15, Costin Lupu wrote:
On 6/17/19 5:15 PM, Simon Kuenzer wrote:
Hey Costin,

thanks a lot for the patch. I have a couple of questions which I need
for my understanding. I put those inline...

Thanks,

Simon

On 11.06.19 21:33, Costin Lupu wrote:
Commit 017fffd5 introduced support for setting the current thread
pointer on top
of interrupt stacks in order to retrieve the current thread in
interrupt context
as well. Unfortunately, the wrong stack was picked for KVM platform.
This patch
fixes that and sets the thread on cpu_intr_stack instead.

cpu_intr_stack was resized to STACK_SIZE because this is a mandatory
condition
when saving threads on top. However, given that it also needs a
STACK_SIZE
alignment, a new section was created for it, .intrstack, in order to
avoid
breaking the entire binary image layout. Without this new section, the
entire
.text section would have a STACK_SIZE alignment (i.e. 64KB) and this
would imply
that the multiboot header, which is included in .text section, would
also be
moved at an address higher than STACK_SIZE, even though it must stay
in the
first 8KB of the binary.

Maybe, this is a stupid question: What if we take the pre-allocated
bootstack also for the interrupts? As soon as we enable scheduling the
bootstack is not used anymore by any thread. Except the case where we do
not have scheduling, the bootstack will be still used during life time,
would this be an issue?


In theory this should work. But what about the case when we don't have
scheduling? Wouldn't we still need a different stack?


Hum... this should be fine, I expect the CPU can handle this.

What do you mean? The CPU will handle it properly only if you provide it
a different stack on which it will save the registers of the interrupted
context.

But we have actually also stacks for the NMI and the trap handler. So
maybe this suggestion is not generic enough.

Of course it's not generic enough. One way would be to align NMI and
trap stacks as well, but that would just bloat the binary. Fortunately,
(1) the NMI stack isn't used and (2) current thread is not needed when
traps are handled, so we can skip updating these 2 stacks for now.


Signed-off-by: Costin Lupu <costin.lupu@xxxxxxxxx>
---
    plat/kvm/Makefile.uk      |  1 +
    plat/kvm/memory.c         |  6 ------
    plat/kvm/x86/link64.lds.S | 10 ++++++++++
    plat/kvm/x86/memory.c     | 44
++++++++++++++++++++++++++++++++++++++++++++
    plat/kvm/x86/traps.c      |  5 ++++-
    5 files changed, 59 insertions(+), 7 deletions(-)
    create mode 100644 plat/kvm/x86/memory.c

diff --git a/plat/kvm/Makefile.uk b/plat/kvm/Makefile.uk
index 71c4c419..8eb162d4 100644
--- a/plat/kvm/Makefile.uk
+++ b/plat/kvm/Makefile.uk
@@ -46,6 +46,7 @@ LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
$(LIBKVMPLAT_BASE)/x86/lcpu.c
    LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
$(LIBKVMPLAT_BASE)/x86/intctrl.c
    LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
$(LIBKVMPLAT_BASE)/x86/tscclock.c
    LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
$(LIBKVMPLAT_BASE)/x86/time.c
+LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
$(LIBKVMPLAT_BASE)/x86/memory.c|x86
    ifeq ($(findstring y,$(CONFIG_KVM_KERNEL_VGA_CONSOLE)
$(CONFIG_KVM_DEBUG_VGA_CONSOLE)),y)
    LIBKVMPLAT_SRCS-$(CONFIG_ARCH_X86_64) +=
$(LIBKVMPLAT_BASE)/x86/vga_console.c
    endif
diff --git a/plat/kvm/memory.c b/plat/kvm/memory.c
index e96bad2c..7f2fb46a 100644
--- a/plat/kvm/memory.c
+++ b/plat/kvm/memory.c
@@ -178,9 +178,3 @@ int _ukplat_mem_mappings_init(void)
    {
        return 0;
    }
-
-void ukplat_stack_set_current_thread(void *thread_addr)
-{
-    *((unsigned long *) _libkvmplat_cfg.bstack.end) =
-        (unsigned long) thread_addr;
-}
diff --git a/plat/kvm/x86/link64.lds.S b/plat/kvm/x86/link64.lds.S
index 362ba3e6..6103fc2d 100644
--- a/plat/kvm/x86/link64.lds.S
+++ b/plat/kvm/x86/link64.lds.S
@@ -99,6 +99,16 @@ SECTIONS
            . = ALIGN(__PAGE_SIZE);
        }
    +    /* We keep the interrupt stack on a different section
+     * given that it may have a big alignment and it would
+     * change the entire binary layout
+     */
+    .intrstack :
+    {
+        *(.intrstack)
+        . = ALIGN(__PAGE_SIZE);
+    }
+

Would every platform need to do this?


On Xen, the interrupt is already aligned to STACK_SIZE. On linuxu, the
signals use the current process stack, we should change it to using an
alternative stack if we want to use the same approach.

        _end = .;
          .comment       0 : { *(.comment) }
diff --git a/plat/kvm/x86/memory.c b/plat/kvm/x86/memory.c
new file mode 100644
index 00000000..b8c7c7e7
--- /dev/null
+++ b/plat/kvm/x86/memory.c
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: BSD-3-Clause */
+/*
+ * Authors: Costin Lupu <costin.lupu@xxxxxxxxx>
+ *
+ * Copyright (c) 2019, University Politehnica of Bucharest. All
rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above
copyright
+ *    notice, this list of conditions and the following disclaimer in
the
+ *    documentation and/or other materials provided with the
distribution.
+ * 3. Neither the name of the copyright holder nor the names of its
+ *    contributors may be used to endorse or promote products derived
from
+ *    this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THIS HEADER MAY NOT BE EXTRACTED OR MODIFIED IN ANY WAY.
+ */
+
+#include <uk/plat/memory.h>
+
+
+extern char cpu_intr_stack[];
+
+void ukplat_stack_set_current_thread(void *thread_addr)
+{
+    *((unsigned long *) cpu_intr_stack) =
+        (unsigned long) thread_addr;

In general, instead of using the stack top, we could save the current
thread always on the stack bottom. Those bytes would be reserved and
would only be popped on a misbehaving program. Would that avoid the
issue that you are facing? Maybe we could get away from the problem that
the stacks have to have a fixed size for just getting the current
value...


How do you determine the stack bottom when you are using it, e.g. when
retrieving the current thread? I think you would use the same way as for
getting the current thread from the stack top. So you still need this
alignment constraint.


You are right.

+}
diff --git a/plat/kvm/x86/traps.c b/plat/kvm/x86/traps.c
index 27ef6d93..fe1dd5a4 100644
--- a/plat/kvm/x86/traps.c
+++ b/plat/kvm/x86/traps.c
@@ -25,7 +25,9 @@
     */
      #include <string.h>
+#include <uk/essentials.h>
    #include <uk/arch/lcpu.h>
+#include <uk/plat/config.h>
    #include <x86/desc.h>
    #include <kvm-x86/traps.h>
    @@ -59,7 +61,8 @@ static void gdt_init(void)
      static struct tss64 cpu_tss;
    -static char cpu_intr_stack[4096]; /* IST1 */
+ __section(".intrstack")  __align(STACK_SIZE)
+char cpu_intr_stack[STACK_SIZE];  /* IST1 */

You removed the stack actually from the .bss section (instead of text)
and moved it to an own section. What if you keep it on the .bss but with
the bigger size? I expect this should not be a problem for the text
section.


In the commit message it should have been '.text segment', or the first
segment of the binary, which contains both .text and .bss sections. So,
yeah, it's the segment that gets realigned. If we don't move it to
another section, it would remain in .bss, which is what this fix tries
to avoid. If we move it to a new section, it will also be moved to a
different segment and that's how the segment containing .text will keep
its original alignment.


I am still not getting exactly why the .bss section has influence on the
positioning on the .text section. The multiboot header is just one
exceptional data section that we put into the beginning of the .text
section. The rest should still go to its respective section. What is the
bad thing about using the .bss section?


Aligning cpu_intr_stack to 0x10000 also aligns .bss section to 0x10000.
For reasons I do not know, the linker also decides to move the .text
section to 0x10000 offset inside the binary file (please see
attachment). Now this is bad, because the .text section also contains
the multiboot information which must stay in the first 0x2000 bytes of
the binary. Moreover, the segment which will contain both sections in
the end will be aligned to 0x10000.

One solution would be to move multiboot information in a section of its
own, out of the .text section. But unfortunately, we still cannot
control the offset where the linker will decide to put it in the binary.
 From what I could find, the only way to fix that would be to put it with
objcopy at the beginning of the binary.

Therefore, the least intrusive solution would be to just move the
aligned interrupt stack in a section of its own and the rest of the
binary would just keep its original layout.

Anyways, since we need a quick fix, I would suggest something else. The
problem I have with the irq_stack section is that it complicates the
linked layout. Additionally for being complete, the nmi and trap stack
would also need to go in there. The other problem we have is that the
stack sizes and/or their alignment need to be fulfilled.
  - If we put the thread context pointer on top of the stack (low
    address), all stacks need to be exactly sized and aligned to
    STACK_SIZE.
  - If we put this pointer to the stack bottom (high address), we need
    to make sure that the high address is aligned to STACK_SIZE.
    STACK_SIZE becomes automatically the maximum stack size but we could
    have stacks that are smaller sized than this. However, I am not
    sure how we could teach the alignment to the linker.

Although I do not like the alternative but it may solve our bug right
now: What if we store the current thread pointer to a platform-internal
variable instead of the stack. On every context switch, we update this
variable and get_current() is returning its value. In principle, with
SMP you would need to introduce this variable CPU-wise but we do not
have this now. This way we could keep the stack sizes that we currently
have.
I think we have to revisit this low-level platform API at some point
considering all the lessons we have learned and that longer-term we
should also protect the each stacks from over- and underflows (for
example with an unmapped page before and after each stack). But this
involves a bigger restructuring anyways and this we should not do with
this patch.

What do you think?


This is something we need to do especially if we want to support
different stack sizes in the system. I also had this discussion with
Florian a few weeks ago about how this is also necessary in order to
avoid some races in the scheduler for which we had to apply a workaround.

However, this needs a more complex analysis and for sure it would have a
big impact, with potential side effects, for the whole system. I would
wait to do that until we'll have a proper regression testing system.

This current fix is the simplest one for now and it doesn't bring any
side effects. I strongly believe this is the best solution for now.

Hum. Understood. Yeah, I agree we need to revisit this when we are going re-visiting the API and re-architecture the platform interfaces. As far as I can see, we are currently using the same technique to save and get the current thread context also on the other platforms. Because of consistency reasons, it would not be wise to change this to something completely different for the KVM platform.

Could you make sure with a v2 that also the trap stack is properly sized and aligned? This implies storing the current thread pointer twice, right? I am not really a fan of this but if we want to keep get_current in its current form in order to avoid further implication to uksched and the other platforms for now, we should at least have a TODO comment there that explains our reason. I expect we will have a better suited platform API later that will avoid this, so this would be anyways temporary.


    static char cpu_trap_stack[4096]; /* IST2 */
    static char cpu_nmi_stack[4096];  /* IST3 */

Cheers,
Costin


Thanks,

Simon

_______________________________________________
Minios-devel mailing list
Minios-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/minios-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.