[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH for-4.12 v2 17/17] xen/arm: Track page accessed between batch of Set/Way operations


On 12/4/18 8:26 PM, Julien Grall wrote:
At the moment, the implementation of Set/Way operations will go through
all the entries of the guest P2M and flush them. However, this is very
expensive and may render unusable a guest OS using them.

For instance, Linux 32-bit will use Set/Way operations during secondary
CPU bring-up. As the implementation is really expensive, it may be possible
to hit the CPU bring-up timeout.

To limit the Set/Way impact, we track what pages has been of the guest
has been accessed between batch of Set/Way operations. This is done
using bit[0] (aka valid bit) of the P2M entry.

This patch adds a new per-arch helper is introduced to perform actions just
before the guest is first unpaused. This will be used to invalidate the
P2M to track access from the start of the guest.

Signed-off-by: Julien Grall <julien.grall@xxxxxxx>


While we can spread d->creation_finished all over the code, the per-arch
helper to perform actions just before the guest is first unpaused can
bring a lot of benefit for both architecture. For instance, on Arm, the
flush to the instruction cache could be delayed until the domain is
first run. This would improve greatly the performance of creating guest.

I am still doing the benchmark whether having a command line option is
worth it. I will provide numbers as soon as I have them.

I remembered Stefano suggested to look at the impact on the boot. This is a bit tricky to do as there are many kernel configurations existing and all the mappings may not have been touched during the boot.

Instead I wrote a tiny guest [1] that will zero roughly 1GB of memory. Because the toolstack will always try to allocate with the biggest mapping, I had to hack a bit the toolstack to be able to test with different mapping size (but not a mix). The guest has only one vCPU with a dedicated pCPU.
        - 1GB: 0.03% slower when starting with valid bit unset
        - 2MB: 0.04% faster when starting with valid bit unset
        - 4KB: ~3% slower when starting with valid bit unset

The performance using 1GB and 2MB mapping is pretty much insignificant because the number of traps is very limited (resp. 1 and 513). With 4KB mapping, there are a much significant drop because you have more traps (~262700) as the P2M contains more entries.

However, having many 4KB mappings in the P2M is pretty unlikely as the toolstack will always try to get bigger mapping. In real world, you should only have 4KB mappings when you guest has not memory aligned with a bigger mapping. If you end up to have many 4KB mappings, then you are already going to have a performance impact in long run because of the TLB pressure.

Overall, I would not recommend to introduce a command line option until we figured out a use case where the trap will be a slow down.



    b       _start                  /* branch to kernel start, magic */
    .long   0                       /* reserved */
.quad 0x0 /* Image load offset from start of RAM */
    .quad   0x0                     /* XXX: Effective Image size */
    .quad   2                       /* kernel flags: LE, 4K page size */
    .quad   0                       /* reserved */
    .quad   0                       /* reserved */
    .quad   0                       /* reserved */
    .byte   0x41                    /* Magic number, "ARM\x64" */
    .byte   0x52
    .byte   0x4d
    .byte   0x64
    .long   0                       /* reserved */

    mrs     x0, CNTPCT_EL0

    adrp    x2, _end
    ldr     x3, =(0x40000000 + (1 << 30))
1:  str     xzr, [x2], #8
    cmp     x2, x3
    b.lo    1b

    mrs     x1, CNTPCT_EL0
    hvc     #0xffff
1:  b       1b

Julien Grall

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.