[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] ARM: cache coherence problem in guestcopy.c

On Tue, 2013-06-18 at 14:03 +0900, Jaeyong Yoo wrote:
> I've encountered a rather unusual bug while I'm implementing live 
> migration on arndale board.
> After resume, domU kernel starts invoking hypercalls and at some point
> the hypercall parameters delivered to xen are corrupted.
> After some debugging (with the help of HW debugger), I found that
> cache polution happens, and here is the detailed sequence.
> 1) DomU kernel allocates a local variable struct xen_add_to_physmap xatp
>    and the GVA of xatp is 0xc785fe38 (note that not cache-line aligned)
>    (see gnttab_map function in linux/drivers/xen/grant-table.c)
> 2) GVA of xatp is mapped in xen page table at raw_copy_from_guest function, 
>    and the VA of xen is 0xae48ee38 and its contents are cached.
> 3) DomU kernel reuses xatp to invoke the second hypercall with different 
>    parameters.
> 4) GVA of xatp is mapped again in the same VA of xen, and the cached data
>    at step 2) (the first hypercall parameter) is loaded.
> The below patch prevents the above problem.

Ouch! I wonder what other random unexplained events could be attributed
to this!

> I'm wondering, as a better solution, that does unmap_domain_page should 
> invalidate the cache before unmap the page?

I think that would be better since the domain_page region mappings are
reference counted so you could avoid a flush every time you unmapped
something and just issue it when the count hits zero.

But actually the domain_page stuff is lazy -- it won't unmap a
refcount==0 slot until it comes to reuse it, this means that if you
unmap then remap the same address you should end up reusing the same
mapping. This means we only need to remap on reuse and not when the
count hits zero. That's why there are TLB flushes in map_domain_page but
not unmap_domain_page. Obviously as you have discovered we need some
data cache flushes in there as well.

So I think we probably actually need the dcache flush in domain_map_page
at the "/* Commandeer this 2MB slot */" point. In that context I don't
think we can avoid flushing anything other than the complete 2MB
mapping. Does this work for you too?

The laziness of the remappings makes me wonder though. Do you know if
the slot is reused between step #2 and #3? Otherwise I'd expect us to
reuse the existing mapping with the cache intact. The caches are PIPT so
I wouldn't expect the address aliasing to be an issue. Unless the
mapping is reused for something else I'm not too sure where the cache
pollution is coming from.


> Signed-off-by: Jaeyong Yoo <jaeyong.yoo@xxxxxxxxxxx>
> ---
>  xen/arch/arm/guestcopy.c |    3 +++
>  1 file changed, 3 insertions(+)
> diff --git a/xen/arch/arm/guestcopy.c b/xen/arch/arm/guestcopy.c
> index d146cd6..9c4a7e4 100644
> --- a/xen/arch/arm/guestcopy.c
> +++ b/xen/arch/arm/guestcopy.c
> @@ -25,6 +25,7 @@ unsigned long raw_copy_to_guest(void *to, const void *from, 
> unsigned len)
>          p += offset;
>          memcpy(p, from, size);
> +        flush_xen_dcache_va_range(p, size);
>          unmap_domain_page(p - offset);
>          len -= size;
>          from += size;
> @@ -55,6 +56,7 @@ unsigned long raw_clear_guest(void *to, unsigned len)
>          p += offset;
>          memset(p, 0x00, size);
> +        flush_xen_dcache_va_range(p, size);
>          unmap_domain_page(p - offset);
>          len -= size;
>          to += size;
> @@ -84,6 +86,7 @@ unsigned long raw_copy_from_guest(void *to, const void 
> __user *from, unsigned le
>          memcpy(to, p, size);
> +        flush_xen_dcache_va_range(p, size);
>          unmap_domain_page(p);
>          len -= size;
>          from += size;

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.