|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 2/2] x86/pv: Provide better SYSCALL backwards compatibility in FRED mode
On 26.03.2026 22:05, Andrew Cooper wrote:
> On 26/03/2026 9:14 am, Jan Beulich wrote:
>> On 25.03.2026 18:02, Andrew Cooper wrote:
>>> In FRED mode, the SYSCALL instruction does not modify %rcx/%r11. Software
>>> using SYSCALL spills %rcx/%r11 around the invocation, which is why FRED not
>>> doing this goes largely unnoticed.
>>>
>>> However, consider the following migration scenario:
>>>
>>> * VM suspends. Hypercall, so SYSCALL, %rcx/%r11 left unmodified
>>> * VM moves to a non-FRED system
>>> * Xen resumes the VM with a real SYSRET instruction
>>>
>>> Instead of resuming at the instruction following the SYSCALL instruction,
>>> the
>>> VM is resumed at whatever dead value was in %rcx.
>> Would it? In restore_all_guest we load %r11 and %rcx from the stack
>> frame's EFLAGS and RIP fields. If we didn't, various other things wouldn't
>> work either.
>
> Hmm. I suppose so. regs->rip/eflags is always going to be
> reconstructed properly for the records in the transmitted stream.
>
> What will be wrong is the %rcx/%r11 put onto the guest stack.
Okay, this is addressed by ...
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -2405,6 +2405,8 @@ void asmlinkage entry_from_pv(struct cpu_user_regs
>>> *regs)
>>>
>>> regs->ssx = l ? FLAT_KERNEL_SS : FLAT_USER_SS32;
>>> regs->csx = l ? FLAT_KERNEL_CS64 : FLAT_USER_CS32;
>>> + regs->rcx = regs->rip;
>>> + regs->r11 = regs->rflags;
... this change.
>> Don't you also need to set TRAP_syscall here, for the new code in
>> eretu_exit_to_guest to actually make a difference?
>
> It is create_bounce_frame() which sets up TRAP_syscall.
Hmm, right. I was misled by {l,c}star_enter and sysenter_entry setting
the flag explicitly. That looks to be necessary only for the pv_hypercall()
path out of lstar_enter; everything else goes through create_bounce_frame().
>> (There actually is
>> a paragraph about this in the comment out of context above, which then
>> may also want adjusting.)
>>
>> Further a question as to limiting overhead: Doing this on every SYSCALL
>> entry ...
>>
>>> @@ -26,7 +27,16 @@ FUNC(entry_FRED_R3, 4096)
>>> END(entry_FRED_R3)
>>>
>>> FUNC(eretu_exit_to_guest)
>>> - POP_GPRS
>>> + /*
>>> + * PV guests aren't aware of FRED. If Xen in IDT mode would have
>>> used
>>> + * a SYSRET instruction, preserve the legacy behaviour for
>>> %rcx/%r11
>>> + */
>>> + testb $TRAP_syscall >> 8, UREGS_entry_vector + 1(%rsp)
>>> +
>>> + POP_GPRS /* Preserves flags */
>>> +
>>> + cmovnz EFRAME_rip(%rsp), %rcx
>>> + cmovnz EFRAME_eflags(%rsp), %r11
>> ... and every exit-to-guest isn't very nice when concern is about just the
>> specific case of migrating FRED -> non-FRED. Couldn't we instead make the
>> adjustment when generating the save record for the register state of the
>> vCPU?
>
> Ignoring migration for a moment, there are two further cases where
> things go wrong. Consider a VM which logically does this:
>
> // user mode
> SYSCALL
> mov %rcx, dbg_syscall_was_here
>
> // kernel mode
> entry_SYSCALL:
> ... setup stack
> mov %rcx, UREGS_rip(%rsp)
>
>
> Both of these positions under FRED have unexpected content in %rcx/%r11.
>
> In userspace it is common to spill %rcx/%r11 and restore them around
> SYSCALL, but that's not an ABI. This is addressed by the hunk in
> entry_from_pv().
Right, and the eretu_exit_to_guest change mirrors what we do ahead of
SYSRET (or its conversion to IRET). While there are cases (for both
paths) where RESTORE_ALL / POP_GPRS would already have restored the
intended values, there are enough other cases where the value would
have changed.
Overall, however, I think that the patch description would want
altering, to cover more aspects (as discussed).
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |