[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 11/28] xsplice: Implement support for applying/reverting/replacing patches.



>>> On 01.04.16 at 23:04, <konrad.wilk@xxxxxxxxxx> wrote:
> On Fri, Apr 01, 2016 at 07:28:00AM -0600, Jan Beulich wrote:
>> >>> On 24.03.16 at 21:00, <konrad.wilk@xxxxxxxxxx> wrote:
>> > From: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>
>> > 
>> > Implement support for the apply, revert and replace actions.
>> > 
>> > To perform and action on a payload, the hypercall sets up a data
>> > structure to schedule the work.  A hook is added in all the
>> > return-to-guest paths to check for work to do and execute it if needed.
>> 
>> Looking at the diffstat alone I cannot see this being the case.
>> Perhaps it's just the description here which needs to become
>> more precise.
> 
> s/all the/common/ ?

Considering that non of the entry.S files get touched, I think it's
the "return-to-guest paths" part which is wrong.

>> > In this way, patches can be applied with all CPUs idle and without
>> > stacks.  The first CPU to do_xsplice() becomes the master and triggers a
>> > reschedule softirq to trigger all the other CPUs to enter do_xsplice()
>> > with no stack.  Once all CPUs have rendezvoused, all CPUs disable IRQs
>> > and NMIs are ignored. The system is then quiscient and the master
>> > performs the action.  After this, all CPUs enable IRQs and NMIs are
>> > re-enabled.
>> > 
>> > Note that it is unsafe to patch do_nmi and the xSplice internal functions.
>> > Patching functions on NMI/MCE path is liable to end in disaster.
>> 
>> So what measures are (planned to be) taken to make sure this
>> doesn't happen by accident?
> 
> None in this patch.
> 
> Please keep in mind that this issue is not only a problem with
> xSplice but also with alternative assembler patching.

Except that an issue there would cause only a boot failure, which
would be unlikely to re-occur on the next boot attempt. Whereas
things going wrong at runtime would likely mean worse
consequences.

> I haven't figured out a nice way to take care of this and was hoping
> to brainstorm that. For right now this part is left out as 'TODO'.

That's what I've assumed, which is fine for the moment. Just maybe
make this more explicit by adding (to the above) something like "For
now this needs to be ensured by the people creating patches."

>> > The action to perform is one of:
>> > - APPLY: For each function in the module, store the first 5 bytes of the
>> >   old function and replace it with a jump to the new function.
>> > - REVERT: Copy the previously stored bytes into the first 5 bytes of the
>> >   old function.
>> > - REPLACE: Revert each applied module and then apply the new module.
>> > 
>> > To prevent a deadlock with any other barrier in the system, the master
>> > will wait for up to 30ms before timing out.
>> > Measurements found that the patch application to take about 100 μs on a
>> > 72 CPU system, whether idle or fully loaded.
>> 
>> That's for an individual patch I suppose? What if REPLACE has to
>> revert dozens or hundreds of patches?
> 
> The user-space can change the timeout to a larger value. 

The question wasn't about the timeout expiring, but about the latency
of the operation (and the resulting effect on the system).

>> > --- a/docs/misc/xsplice.markdown
>> > +++ b/docs/misc/xsplice.markdown
>> > @@ -841,7 +841,8 @@ The implementation must also have a mechanism for:
>> >   * Be able to lookup in the Xen hypervisor the symbol names of functions 
>> > from the ELF payload.
>> >   * Be able to patch .rodata, .bss, and .data sections.
>> >   * Further safety checks (blacklist of which functions cannot be patched, 
>> > check
>> > -   the stack, make sure the payload is built with same compiler as 
>> > hypervisor).
>> > +   the stack, make sure the payload is built with same compiler as 
>> > hypervisor,
>> > +   and NMI/MCE handlers and do_nmi for right now - until an safe solution 
>> > is found).
>> 
>> The whole thing doesn't parse anymore for me with the addition,
>> so I can't really conclude what you mean to say here (and hence
>> whether that addresses the earlier question).
> 
> It is adding handling of NMI/MCE on the 'TODO' list so that I don't forget
> about it.

Please try to re-word the addition then to make the result readable
again.

>> > +/* There can be only one outstanding patching action. */
>> > +static struct xsplice_work xsplice_work;
>> > +
>> > +/* Indicate whether the CPU needs to consult xsplice_work structure. */
>> > +static DEFINE_PER_CPU(bool_t, work_to_do);
>> 
>> Peeking ahead to the uses of this, I can't see why this is needed
>> alongside xsplice_work.do_work.
> 
> Andrew asked for this - he noticed that we pound on the xsplice_work
> structure quite often across a lot of CPUs. Instead of pounding on the same
> cache line - we could just read from a per-cpu cache line. Hence this 
> addition.

Then please make the comment say so in a more explicit way.

>> > +        sec = xsplice_elf_sec_by_name(elf, names[i]);
>> > +        if ( !sec )
>> > +        {
>> > +            printk(XENLOG_ERR "%s%s: %s is missing!\n",
>> > +                   XSPLICE, elf->name, names[i]);
>> > +            return -EINVAL;
>> > +        }
>> > +
>> > +        if ( !sec->sec->sh_size )
>> > +            return -EINVAL;
>> > +    }
>> > +
>> > +    return 0;
>> > +}
>> 
>> So you check for there being one such section. Is having multiple
>> of them okay / meaningful?
> 
> /me blinks. You can have multiple ELF sections with the same name?
> I will double-check the spec over the weekend to see.

Of course you can. Remember me telling you that using section
names for identification is a bad idea (even if widely done that
way)? That's precisely because section names in the spec serve
no meaning other than making sections identifiable to the human
eye. For certain sections there's a more or less strict convention
on what their names should be, but that's all there is to names.
Section types and attributes really are what describe their
purposes.

And even if that really was forbidden by the spec, you'd still need
to make sure the binary meets the spec.

>> > +    sec = xsplice_elf_sec_by_name(elf, ".xsplice.funcs");
>> > +    if ( sec )
>> 
>> Assuming check_special_sections() got invoked before you get here,
>> why the conditional?
> 
> Andrew asked for it. Albeit it is pointless as 'check_special_sections'
> will bail halt the process if this section is not found.

It's not clear why he would have wanted the check here. Imo, if
anything this should be an ASSERT() then.

>> > +    }
>> > +
>> > +    for ( i = 0; i < payload->nfuncs; i++ )
>> > +    {
>> > +        unsigned int j;
>> > +
>> > +        f = &(payload->funcs[i]);
>> > +
>> > +        if ( !f->new_addr || !f->old_addr || !f->old_size || !f->new_size 
>> > )
>> 
>> Isn't new_size == 0 a particularly easy to deal with case?
> 
> To NOP func? No. I would have to include the headers for the nop[] in the 
> arch xSplice
> and expand on its patching (or expose the text_poke code).
> 
> That will grow this patch which is big enough. I don't mind doing it in 
> further patches
> (and I believe it is mentioned as a TODO in the design doc so that I won't 
> forget).

That's fine, except for one aspect: The error value resulting from
this: EINVAL is appropriate only for invalid input. When valid input
just doesn't get handled suitably, EOPNOTSUPP or some such should
be used instead.

>> > +    default:
>> > +        rc = -EINVAL;
>> > +        break;
>> > +    }
>> > +
>> > +    data->rc = rc;
>> 
>> Oh, here it is. But why would an xsplice_work.cmd (which probably
>> shouldn't make it here anyway) mark the payload having an error?
> 
> In xsplice_action() we set data->rc to -EAGAIN right before we kick
> of the schedule_work(). That way the user can check the 'status'
> of the patching as we attempt to do it.
> 
> But once the patching has been complete we MUST set it to zero
> (or to an error if the patching failed).

Okay, makes sense. But I realize a word was missing from my reply.
I meant "But why would an invalid xsplice_work.cmd (which probably
shouldn't make it here anyway) mark the payload having an error?"

I.e. the question particularly is about the default case (with the
suggestion to just ditch it, or make in an ASSERT_UNREACHABLE()).

>> It didn't change state or anything after all.
> .. snip..
>> > +void check_for_xsplice_work(void)
>> > +{
>> > +    /* Set at -1, so will go up to num_online_cpus - 1. */
>> > +    if ( atomic_inc_and_test(&xsplice_work.semaphore) )
>> > +    {
>> > +        struct payload *p;
>> > +        unsigned int total_cpus;
>> > +
>> > +        p = xsplice_work.data;
>> > +        if ( !get_cpu_maps() )
>> > +        {
>> > +            printk(XENLOG_ERR "%s%s: CPU%u - unable to get cpu_maps 
>> > lock!\n",
>> > +                   XSPLICE, p->name, cpu);
>> > +            per_cpu(work_to_do, cpu) = 0;
>> > +            xsplice_work.data->rc = -EBUSY;
>> > +            xsplice_work.do_work = 0;
>> 
>> On x86 such possibly simultaneous accesses may be okay, but is
>> that universally the case? Wouldn't it better be only the monarch
>> which updates shared data?
> 
> Monarch? Oh you mean arch specific code path?

Oh, I think you call it "master" elsewhere. IA64 terminology which
I came to like.

>> > +        /* All CPUs are waiting, now signal to disable IRQs. */
>> > +        xsplice_work.ready = 1;
>> > +        smp_wmb();
>> > +
>> > +        atomic_inc(&xsplice_work.irq_semaphore);
>> > +        if ( !xsplice_do_wait(&xsplice_work.irq_semaphore, timeout, 
>> > total_cpus,
>> > +                              "Timed out on IRQ semaphore") )
>> 
>> I'd prefer if the common parts of that message moved into
>> xsplice_do_wait() - no reason to have more string literal space
>> occupied than really needed. Also, looking back, the respective
>> function parameter could do with a more descriptive name.
>> 
>> And then - does it make sense to wait the perhaps full 30ms
>> on this second step? Rendezvoused CPUs should react rather
>> quickly to the signal to disable interrupts.
> 
> We don't reset the timeout - the timeout is for both calls
> to xsplice_do_wait.

I understand that's the way it's right now, but that's what I'm putting
under question. Rendezvousing CPUs is quite a bit more at risk of
taking some time compared to having already rendezvoused CPUs
disable their IRQs.

>> 
>> > +        {
>> > +            local_irq_save(flags);
>> > +            /* Do the patching. */
>> > +            xsplice_do_action();
>> > +            /* To flush out pipeline. */
>> > +            arch_xsplice_post_action();
>> 
>> The comment needs to become more generic, and maybe the
>> function name more specific.
> 
> Serialize CPUs? Flush CPU pipeline? Flush CPU i-cache?

Perhaps mention all of these as examples of what may need doing
in that hook.

>> > +            local_irq_restore(flags);
>> > +        }
>> > +        set_nmi_callback(saved_nmi_callback);
>> > +
>> > + abort:
>> > +        per_cpu(work_to_do, cpu) = 0;
>> > +        xsplice_work.do_work = 0;
>> > +
>> > +        smp_wmb(); /* Synchronize with waiting CPUs. */
>> 
>> What "synchronization" does this barrier do?
> 
> For the ->do_work being set to zero - as they are reading and reading it.

But barriers don't do any synchronization, they only guarantee
certain ordering requirements. I.e. a comment on a barrier would
normally indicate which operations it is that it enforces ordering
for.

>> > +        ASSERT(local_irq_is_enabled());
>> 
>> Is there really anything between the earlier identical ASSERT() and
>> this one which could leave interrupts off?
> 
> The patching calls (in later patches) - the load and unload hooks. Those
> could inadvertly enabled interrupts.

In which case the ASSERT() would better be placed right after those
hooks (and get added alongside adding the hooks, which is pending
a decision on whether those should stay anyway).

>> > +        put_cpu_maps();
>> > +
>> > +        printk(XENLOG_INFO "%s%s finished with rc=%d\n", XSPLICE,
>> > +               p->name, p->rc);
>> 
>> And no record left of what was done with that payload?
> 
> ? But it does. The rc is mentioned.. Or are you saying that it should
> also say whether it was applied/reverted/replaced, etc?

Yes, exactly.

>> > +    }
>> > +    else
>> > +    {
>> > +        /* Wait for all CPUs to rendezvous. */
>> > +        while ( xsplice_work.do_work && !xsplice_work.ready )
>> > +        {
>> > +            cpu_relax();
>> > +            smp_rmb();
>> 
>> What is the barrier doing inside this and the other loop below?
> 
> The goal is to get the updated value from ->do_work and ->ready.
> 
> That is to do the same thing we do with frontend and backends - sync
> out the rp/rc so that the other end can read it the updated value.

But that's guaranteed already by cpu_relax() being a barrier (see
other loops in the code base invoking cpu_relax()). You have no
ordering requirement here, all you want is that the compiler not
move the two reads out of the loop.

>> >  static int __init xsplice_init(void)
>> >  {
>> > +    BUILD_BUG_ON( sizeof(struct xsplice_patch_func) != 64 );
>> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_addr) != 8 );
>> > +    BUILD_BUG_ON( offsetof(struct xsplice_patch_func, new_size) != 24 );
>> 
>> What assumptions get broken if these sizes change?
> 
> I think you mean the offsets? They can change. I added them to make sure
> that on 32-bit hypervisor (ARM) the offsets would be the same as on 64-bit
> hypervisor (x86).
> 
> The size however MUST remain the same - otherwise the toolstack won't 
> produce proper payloads anymore.

Well, that's a very bad thing then imo. You'd better version the
structure instead.

>> > --- a/xen/include/xen/xsplice.h
>> > +++ b/xen/include/xen/xsplice.h
>> > @@ -11,12 +11,30 @@ struct xsplice_elf_sec;
>> >  struct xsplice_elf_sym;
>> >  struct xen_sysctl_xsplice_op;
>> >  
>> > +#include <xen/elfstructs.h>
>> > +/*
>> > + * The structure which defines the patching. This is what the hypervisor
>> > + * expects in the '.xsplice.func' section of the ELF file.
>> > + *
>> > + * This MUST be in sync with what the tools generate.
>> 
>> In which case this need to be part of the public interface, I would
>> say.
> 
> Ah yes. Where should it go? sysctl.h is where the XSPLICE ops are - or
> should this have it is own public/xsplice.h  header file?

If it's just this, then I don't think a separate header is warranted,
and hence putting it in sysctl.h would seem fine to me.

>> > +    uint8_t pad[24];
>> 
>> What is this padding field good for?
> 
> I want the size of the structure to be exactly 64-bytes across
> all architectures. That way I don't have to mess with compat layer
> or have the design doc be different for ARM vs x86.

For avoiding the compat layer, guaranteeing same size is not
sufficient anyway. And the design doc doesn't need to mention
structure sizes at all.

> It gives us enough of space to stash other 'private' fields we may want
> or even expand this structure in the future and fill it out
> with extra fields. ARgh, but for that we need a version field
> somewhere... Maybe another section .xsplice_version which just
> has a value of 1?
> 
> However the one thing I am not sure about is the padding.
> 
> The 'const char *name' on 32-bit ARM will be 4 bytes hence
> it will insert 4 byte padding (at least that is what pahole tells me,
> and the BUILD_BUG_ON confirms). But if I built with clang or other
> compiler it may very well ignore me.
> 
> Any suggestions? I could do:
> 
> struct xsplice_patch_func {
>     const char *name;
> #ifdef CONFIG_32
>     uint32_t _name_pad;
> #endif
>     uint64_t new_addr;
>     uint64_t old_addr;
>     uint32_t new_size;
>     uint32_t old_size;
>     uint8_t undo[8];
>     uint8_t pad[24];
> }
> 
> But that is not very nice.

Indeed. Perhaps a union, but I'm questioning the need of uniformity
across architectures in the first place.

> Also if we make this a public header I can't very well expose the
> 'undo' internals - it ought to be opaque. Ideas?

Split it, and link the image provided structure from the internally
used one. Or have the internally used one be a superset of the
one coming from the image, and simply copy the data (at once
allowing you to discard the respective section after loading the
image).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.