[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RESEND v5 4/6] xen/arm: Implement virtual-linear page table for guest p2m mapping in live migration



> -----Original Message-----
> From: Ian Campbell [mailto:Ian.Campbell@xxxxxxxxxx]
> Sent: Wednesday, November 13, 2013 12:59 AM
> To: Jaeyong Yoo
> Cc: xen-devel@xxxxxxxxxxxxx
> Subject: Re: [Xen-devel] [PATCH RESEND v5 4/6] xen/arm: Implement virtual-
> linear page table for guest p2m mapping in live migration
> 
> On Fri, 2013-11-08 at 16:50 +0900, Jaeyong Yoo wrote:
> > Implement vlpt (virtual-linear page table) for fast accessing of 3rd PTE
> of guest p2m.
> > When creating a mapping for vlpt, just copy the 1st level PTE of guest
> > p2m to the xen's 2nd level PTE. Then the mapping becomes the following:
> >     xen's 1st PTE -->
> >     xen's 2nd PTE (which is the same as 1st PTE of guest p2m) -->
> >     guest p2m's 2nd PTE -->
> >     guest p2m's 3rd PTE (the memory contents where the vlpt points)
> > For more info about vlpt, see:
> > http://www.technovelty.org/linux/virtual-linear-page-table.html
> >
> > This function is used in dirty-page tracing: when domU write-fault is
> > trapped by xen, xen can immediately locate the 3rd PTE of guest p2m.
> > The following link shows the performance comparison for handling a
> > dirty-page between vlpt and typical page table walking.
> > http://lists.xen.org/archives/html/xen-devel/2013-08/msg01503.html
> >
> > Changes from v4:
> > 1. In the restoring vlpt, use __foo variant without barriers for
> write_pte and
> >    flush_xen_data_tlb_range_va.
> > 2. Support two consecutive pages for guest's first level page table.
> >
> > Signed-off-by: Jaeyong Yoo <jaeyong.yoo@xxxxxxxxxxx>
> > ---
> >  xen/arch/arm/domain.c            |   5 ++
> >  xen/arch/arm/mm.c                | 112
> +++++++++++++++++++++++++++++++++++++++
> >  xen/include/asm-arm/arm32/page.h |  41 ++++++++++----
> >  xen/include/asm-arm/config.h     |   5 ++
> >  xen/include/asm-arm/domain.h     |   7 +++
> >  xen/include/asm-arm/mm.h         |  15 ++++++
> >  6 files changed, 174 insertions(+), 11 deletions(-)
> >
> > diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c index
> > cb0424d..c0b5dd8 100644
> > --- a/xen/arch/arm/domain.c
> > +++ b/xen/arch/arm/domain.c
> > @@ -509,6 +509,11 @@ int arch_domain_create(struct domain *d, unsigned
> int domcr_flags)
> >      /* Default the virtual ID to match the physical */
> >      d->arch.vpidr = boot_cpu_data.midr.bits;
> >
> > +    d->arch.dirty.second_lvl_start = 0;
> > +    d->arch.dirty.second_lvl_end = 0;
> > +    d->arch.dirty.second_lvl[0] = NULL;
> > +    d->arch.dirty.second_lvl[1] = NULL;
> > +
> >      clear_page(d->shared_info);
> >      share_xen_page_with_guest(
> >          virt_to_page(d->shared_info), d, XENSHARE_writable); diff
> > --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c index 3801f07..bf13993
> > 100644
> > --- a/xen/arch/arm/mm.c
> > +++ b/xen/arch/arm/mm.c
> > @@ -1328,6 +1328,118 @@ void get_gma_start_end(struct domain *d, paddr_t
> *start, paddr_t *end)
> >          *end = GUEST_RAM_BASE + ((paddr_t) d->max_pages <<
> > PAGE_SHIFT);  }
> >
> > +/* flush the vlpt area */
> > +void flush_vlpt(struct domain *d)
> > +{
> > +    int flush_size;
> > +    flush_size = (d->arch.dirty.second_lvl_end -
> > +                  d->arch.dirty.second_lvl_start) << SECOND_SHIFT;
> > +    /* flushing the 3rd level mapping */
> > +    flush_xen_data_tlb_range_va(d->arch.dirty.second_lvl_start <<
> SECOND_SHIFT,
> > +                                flush_size);
> 
> All these << SECOND_SHIFT, can they be put into a macro
> VLPT_OFFSET(guest_vaddr) perhaps (assuming that's the right semantics)
> 

VLPT_OFFSET sounds fine.

> > +}
> > +
> > +/* restore the xen page table for vlpt mapping for domain d */ void
> > +restore_vlpt(struct domain *d)
> 
> I guess this is called on context switch? Why not in this patch?

OK. 

> 
> > +{
> > +    int i;
> > +    dsb();
> > +    for ( i = d->arch.dirty.second_lvl_start;
> > +          i < d->arch.dirty.second_lvl_end;
> > +          ++i )
> > +    {
> > +        int k = i % LPAE_ENTRIES;
> > +        int l = i / LPAE_ENTRIES;
> > +
> > +        if ( xen_second[i].bits != d->arch.dirty.second_lvl[l][k].bits )
> > +        {
> > +            __write_pte(&xen_second[i], d->arch.dirty.second_lvl[l][k]);
> > +            __flush_xen_data_tlb_range_va(i << SECOND_SHIFT,
> > +                                          1 << SECOND_SHIFT);
> > +        }
> > +    }
> > +    dsb();
> > +    isb();
> > +}
> > +
> > +/* setting up the xen page table for vlpt mapping for domain d */ int
> > +prepare_vlpt(struct domain *d) {
> > +    int xen_second_linear_base;
> > +    int gp2m_start_index, gp2m_end_index;
> > +    struct p2m_domain *p2m = &d->arch.p2m;
> > +    struct page_info *second_lvl_page;
> > +    paddr_t gma_start = 0;
> > +    paddr_t gma_end = 0;
> > +    lpae_t *first[2];
> > +    int i;
> > +    uint64_t required, avail = VIRT_LIN_P2M_END - VIRT_LIN_P2M_START;
> > +
> > +    get_gma_start_end(d, &gma_start, &gma_end);
> > +    required = (gma_end - gma_start) >> LPAE_SHIFT;
> > +
> > +
> > +    if ( required > avail )
> 
> avail is the number of bytes of virtual address space in the linear p2m
> area.
> 
> required is the number of pages which the guest has. Does this comparison
> correctly account for each page being represented by an 8-byte lpae_t
> entry?

Yes, it does. See, LPAE_SHIFT only shifts 9 bits. Shifting PAGE_SHIFT (12 bits)
gives the number of pages (every bit), and 9-bit shift gives the required
memory for storing third-level PTE. Since this one-shifting of LPAE_SHIFT
is confusing enough, we can make it explicit by using something like  
required = ((gma_end - gma_start) >> PAGE_SHIFT ) * (sizeof lpae_t).

> 
> > +    {
> > +        dprintk(XENLOG_ERR, "Available VLPT is small for domU guest"
> > +                            "(avail: %llx, required: %llx)\n",
> > +                            avail, required);
> > +        return -ENOMEM;
> > +    }
> > +
> > +    xen_second_linear_base =
> > + second_linear_offset(VIRT_LIN_P2M_START);
> > +
> > +    gp2m_start_index = gma_start >> FIRST_SHIFT;
> > +    gp2m_end_index = (gma_end >> FIRST_SHIFT) + 1;
> > +
> > +    if ( xen_second_linear_base + gp2m_end_index >= LPAE_ENTRIES * 2 )
> > +    {
> > +        dprintk(XENLOG_ERR, "xen second page is small for VLPT for domU");
> > +        return -ENOMEM;
> > +    }
> > +
> > +    second_lvl_page = alloc_domheap_pages(NULL, 1, 0);
> 
> There'd be no harm in allocating the two pages separately I think, and
> avoiding the possiblilty of an order 1 allocation failing.
> 
> But if you do allocate them contiguously then d->arch.dirty.second_lvl can
> be a simple lpae_t * and you can just access offsets 0..1023 without
> worrying about managing the array of two pointers.
> 
> As it stands I think you have the worst of both worlds.
> 
> We could also just consider allocating all firstlevel p2m pages from the
> xenheap instead of the domheap.

OK. xenheap allocation sounds good. I was not sure if I can use xenheap
for the one that looks like 'domain-specific' purpose. If it's OK, it would be
much better code readability.

> 
> > +    if ( second_lvl_page == NULL )
> > +        return -ENOMEM;
> > +
> > +    /* First level p2m is 2 consecutive pages */
> > +    d->arch.dirty.second_lvl[0] = map_domain_page_global(
> > +                                           page_to_mfn(second_lvl_page) );
> > +    d->arch.dirty.second_lvl[1] = map_domain_page_global(
> > +
> > + page_to_mfn(second_lvl_page+1) );
> > +
> > +    first[0] = __map_domain_page(p2m->first_level);
> > +    first[1] = __map_domain_page(p2m->first_level+1);
> > +    for ( i = gp2m_start_index; i < gp2m_end_index; ++i )
> 
> Can't this just be a loop over 0..1023 and avoid all this modular
> arithmetic:

Sure.

> > +    {
> > +        int k = i % LPAE_ENTRIES;
> > +        int l = i / LPAE_ENTRIES;
> > +        int k2 = (xen_second_linear_base + i) % LPAE_ENTRIES;
> > +        int l2 = (xen_second_linear_base + i) / LPAE_ENTRIES;
> > +
> > +        write_pte(&xen_second[xen_second_linear_base+i],
> > + first[l][k]);
> > +
> > +        /* we copy the mapping into domain's structure as a reference
> > +         * in case of the context switch (used in restore_vlpt) */
> 
> Do you keep this up to date when changing the p2m?
> 
> In fact, why aren't you just mapping the actual first level pages? Why
> does it have to be a copy?

Oh right. That is much better! Then, no need to maintain the copy. Thanks!

> 
> > +        d->arch.dirty.second_lvl[l2][k2] = first[l][k];
> > +    }
> > +    unmap_domain_page(first[0]);
> > +    unmap_domain_page(first[1]);
> > +
> > +    /* storing the start and end index */
> > +    d->arch.dirty.second_lvl_start = xen_second_linear_base +
> gp2m_start_index;
> > +    d->arch.dirty.second_lvl_end = xen_second_linear_base +
> > + gp2m_end_index;
> > +
> > +    flush_vlpt(d);
> > +    return 0;
> > +}
> > +
> > +void cleanup_vlpt(struct domain *d)
> > +{
> > +    /* First level p2m is 2 consecutive pages */
> > +    unmap_domain_page_global(d->arch.dirty.second_lvl[0]);
> > +    unmap_domain_page_global(d->arch.dirty.second_lvl[1]);
> > +}
> >  /*
> >   * Local variables:
> >   * mode: C
> > diff --git a/xen/include/asm-arm/arm32/page.h
> > b/xen/include/asm-arm/arm32/page.h
> > index cf12a89..0a4e115 100644
> > --- a/xen/include/asm-arm/arm32/page.h
> > +++ b/xen/include/asm-arm/arm32/page.h
> > @@ -5,20 +5,28 @@
> >
> >  /* Write a pagetable entry.
> >   *
> > - * If the table entry is changing a text mapping, it is
> > responsibility
> > - * of the caller to issue an ISB after write_pte.
> > + * All necessary barriers are responsibility of the caller
> >   */
> > -static inline void write_pte(lpae_t *p, lpae_t pte)
> > +static inline void __write_pte(lpae_t *p, lpae_t pte)
> >  {
> >      asm volatile (
> > -        /* Ensure any writes have completed with the old mappings. */
> > -        "dsb;"
> >          /* Safely write the entry (STRD is atomic on CPUs that support
> LPAE) */
> >          "strd %0, %H0, [%1];"
> > -        "dsb;"
> >          : : "r" (pte.bits), "r" (p) : "memory");  }
> >
> > +/* Write a pagetable entry with dsb barriers.
> > + *
> > + * If the table entry is changing a text mapping, it is
> > +responsibility
> > + * of the caller to issue an ISB after write_pte.
> > + */
> > +static inline void write_pte(lpae_t *p, lpae_t pte) {
> > +    dsb();
> 
> Please retain the comments from the original version.

OK.

> 
> > +    __write_pte(p, pte);
> > +    dsb();
> > +}
> > +
> >  /* Inline ASM to flush dcache on register R (may be an inline asm
> > operand) */  #define __flush_xen_dcache_one(R) STORE_CP32(R, DCCMVAC)
> >
> > @@ -57,18 +65,28 @@ static inline void flush_xen_data_tlb(void)  }
> >
> >  /*
> > - * Flush a range of VA's hypervisor mappings from the data TLB. This
> > is not
> > - * sufficient when changing code mappings or for self modifying code.
> > + * Flush a range of VA's hypervisor mappings from the data TLB.
> > + * All necessary barriers are responsibility of the caller
> >   */
> > -static inline void flush_xen_data_tlb_range_va(unsigned long va,
> > unsigned long size)
> > +static inline void __flush_xen_data_tlb_range_va(unsigned long va,
> > +unsigned long size)
> >  {
> >      unsigned long end = va + size;
> > -    dsb(); /* Ensure preceding are visible */
> >      while ( va < end ) {
> >          asm volatile(STORE_CP32(0, TLBIMVAH)
> >                       : : "r" (va) : "memory");
> >          va += PAGE_SIZE;
> >      }
> > +}
> > +
> > +/*
> > + * Flush a range of VA's hypervisor mappings from the data TLB with
> barriers.
> > + * This barrier is not sufficient when changing code mappings or for
> > +self
> > + * modifying code.
> > + */
> > +static inline void flush_xen_data_tlb_range_va(unsigned long va,
> > +unsigned long size) {
> > +    dsb(); /* Ensure preceding are visible */
> > +    __flush_xen_data_tlb_range_va(va, size);
> >      dsb(); /* Ensure completion of the TLB flush */
> >      isb();
> >  }
> > diff --git a/xen/include/asm-arm/config.h
> > b/xen/include/asm-arm/config.h index 5b7b1a8..15ad56d 100644
> > --- a/xen/include/asm-arm/config.h
> > +++ b/xen/include/asm-arm/config.h
> > @@ -87,6 +87,7 @@
> >   *   0  -   8M   <COMMON>
> >   *
> >   *  32M - 128M   Frametable: 24 bytes per page for 16GB of RAM
> > + * 128M - 256M   Virtual-linear mapping to P2M table
> >   * 256M -   1G   VMAP: ioremap and early_ioremap use this virtual address
> >   *                    space
> >   *
> > @@ -124,7 +125,9 @@
> >  #define CONFIG_SEPARATE_XENHEAP 1
> >
> >  #define FRAMETABLE_VIRT_START  _AT(vaddr_t,0x02000000)
> > +#define VIRT_LIN_P2M_START     _AT(vaddr_t,0x08000000)
> >  #define VMAP_VIRT_START  _AT(vaddr_t,0x10000000)
> > +#define VIRT_LIN_P2M_END       VMAP_VIRT_START
> >  #define XENHEAP_VIRT_START     _AT(vaddr_t,0x40000000)
> >  #define XENHEAP_VIRT_END       _AT(vaddr_t,0x7fffffff)
> >  #define DOMHEAP_VIRT_START     _AT(vaddr_t,0x80000000)
> > @@ -157,6 +160,8 @@
> >
> >  #define HYPERVISOR_VIRT_END    DIRECTMAP_VIRT_END
> >
> > +/*TODO (ARM_64): define VIRT_LIN_P2M_START VIRT_LIN_P2M_END */
> > +
> >  #endif
> >
> >  /* Fixmap slots */
> > diff --git a/xen/include/asm-arm/domain.h
> > b/xen/include/asm-arm/domain.h index 67bfbbc..4f366f1 100644
> > --- a/xen/include/asm-arm/domain.h
> > +++ b/xen/include/asm-arm/domain.h
> > @@ -112,6 +112,13 @@ struct arch_domain
> >          spinlock_t                  lock;
> >      } vuart;
> >
> > +    /* dirty-page tracing */
> > +    struct {
> > +        volatile int second_lvl_start;   /* for context switch */
> > +        volatile int second_lvl_end;
> 
> Are these virtual addresses, or pages or something else? We mostly hav
> specific types (e.g. vaddr_t) for such things.
> 

OK

> volatile is almost always wrong to use. Why do you want it here? If there
> is an actual coherency issue you will need proper barriers, not volatile.

Ah, OK.
Barriers would be more explicit.

> 
> Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.