[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 2/5] x86: use PDEP/PEXT for maddr/direct-map-offset conversion when available



>>> On 25.09.18 at 19:15, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 10/09/18 11:00, Jan Beulich wrote:
>> Well, I continue to not really agree. First and foremost, as said before,
>> the common (exclusive?) case is going to be that with "x86: use MOV
>> for PFN/PDX conversion when possible" no calls will exist at runtime at
>> all.
> 
> Taking this one step further, why don't we drop PDX entirely?
> 
> I seem to recall you saying that the one system it was introduced for
> never shipped, at which point, why bother keeping the code around?

For one I don't know if they or anyone else have plans to ship something
like this in the future. And then ARM uses PDX as well.

> A separate point which has only just occurred to me is the humongous
> pipeline stall which occurs when mixing legacy and VEX SSE instructions
> on SandyBridge and later hardware.  I severely doubt that a single
> transformation from ALU operations to PDEP/PEXT is going to make up for
> the pipeline stall if the guest is using legacy SSE, although given how
> common the PDX conversions are, I could easily believe that the net is
> in the same ballpark.

I think you're mixing up SIMD and VEX-encoded GPR insns. I'm not
aware of the latter (which PDEP and PEXT belong to) falling into the
category where that SIMD register related stall would occur. That's
only related to non-VEX-encoded SIMD insns not updating the full
YMM / ZMM registers afaik (and iirc has been largely taken care of
in newer hardware).

>> At that point all function instances could collectively be purged just
>> like .init.text, if we cared enough. And then, for this particular case,
>> leaving the compiler the widest possible choice of register allocation
>> still seems pretty desirable to me. I'd agree with your "register
>> renames at compile time are free" only if there weren't special uses of
>> quite a few of the registers.
>>
>> As perhaps a prime example, consider the case where the
>> transformation here gets done in the course of setting up another
>> function's arguments. The compiler would have to avoid certain
>> registers (or generate extra MOVs) if it had to first pass the input
>> in a fixed register to the helper here (which then additionally also
>> needs to be assumed to clobber several other ones).
> 
> Once again, I think you are focusing on the wrong aspect, and ending up
> with something which is worse overall.
> 
> First, is there a single example here where the compiler sets up
> registers before a function call? Given the sequence point, it is
> distinctly non-trivial to optimise around.

First and foremost

#define __map_domain_page(pg)        map_domain_page(page_to_mfn(pg))

static inline void *__map_domain_page_global(const struct page_info *pg)
{
    return map_domain_page_global(page_to_mfn(pg));
}

Plus efi_rs_enter() has

    switch_cr3_cr4(virt_to_maddr(efi_l4_pgtable), read_cr4());

and gnttab_unpopulate_status_frames() has

                     : guest_physmap_remove_page(d, gfn,
                                                 page_to_mfn(pg), 0);

just to give a few examples, and while looking for some I've already
skipped printk() invocations, init-only code and alike.

> Irrespective, a couple of extra movs (which are handled during register
> renaming) is far less overhead than hitting a cold icache line, and
> having 256 variations of this stub function is a very good way to hit a
> lot of cold icache lines.

Why do you continue to think that the called functions would be any
more cold than the call sites? I'm not going to claim I perfectly know
all details of when and how prefetching works on the various CPU
models, but I think direct calls should be pretty easy to handle in
hardware.

> Ultimately, real performance numbers are the only way to say for sure,
> but I expect you'll be surprised by the results you'd see.

I don't think I would, because for there to be surprises I'd have to
have hardware where I actually end up using the stub functions (or
revive/reconstruct the patch I used to have to fake memory holes).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.