[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Assertion 'l1e_get_pfn(MAPCACHE_L1ENT(hashent->idx)) == hashent->mfn' failed at domain_page.c:203



>>> On 02.12.13 at 21:33, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:
> (XEN) ----[ Xen-4.4-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    6
> (XEN) RIP:    e008:[<ffff82d08016187b>] map_domain_page+0x1fb/0x4af
> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
> (XEN) rax: 0000000000244dbd   rbx: ffff83042cb59000   rcx: ffff810000000000
> (XEN) rdx: 000000f820060006   rsi: 0000004100200090   rdi: 0000000000000000
> (XEN) rbp: ffff83042cb67db8   rsp: ffff83042cb67d78   r8:  00000000deadbeef
> (XEN) r9:  00000000deadbeef   r10: ffff82d08023d160   r11: 0000000000000246
> (XEN) r12: ffff8300ba712000   r13: 0000000000244dbd   r14: 0000000000000012
> (XEN) r15: 0000000000000005   cr0: 0000000080050033   cr4: 00000000000406f0
> (XEN) cr3: 00000002e03c2000   cr2: 000000370d4de180
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff83042cb67d78:
> (XEN)    0000000000000f2a 0000000000000286 0000003c6c3d8dea 0000000000244dbd
> (XEN)    ffff82e00489b7a0 0000000000000000 ffff880026625c60 0000000000000000
> (XEN)    ffff83042cb67ef8 ffff82d08017b69f ffff83042cb67dd8 ffff82d08015cc0b
> (XEN)    ffff83042cb67e38 ffff82d080160a8b 0000000000000000 0000000000000000
> (XEN)    0000000000000000 ffff83042cb67ea8 0000000000000000 0000000000244dbd
> (XEN)    ffff8300ba712000 0000000000000000 0000000000000000 ffff820040069240
> (XEN)    00007ff000000000 0000000000000000 ffff82e00489b7a0 ffff83042cb59000
> (XEN)    ffff83042cb67eb8 ffff83042cb60000 ffff83042cb60000 0000000500000000
> (XEN)    ffff83042cb59000 ffff8300ba712000 ffff83042cb59000 0000000500000001
> (XEN)    ffff83042cb67f08 0000000000000000 ffff83042cb67f18 00000000ba712000
> (XEN)    0000000244dbd6f0 0000000417a0e025 ffff83042cb67f08 ffff8300ba712000
> (XEN)    ffff88011a98f6f0 0000000417a0e025 0000000000000000 0000000417a0e025
> (XEN)    00007cfbd34980c7 ffff82d0802248db ffffffff8100102a 0000000000000001
> (XEN)    0000000001e097f8 0000000001dc2010 0000000001dc77e0 0000000000000000
> (XEN)    ffff880026625c98 00000000000006f0 0000000000000246 0000000000007ff0
> (XEN)    ffffea00044a41dc 0000000000000000 0000000000000001 ffffffff8100102a
> (XEN)    0000000000000000 0000000000000001 ffff880026625c60 0001010000000000
> (XEN)    ffffffff8100102a 000000000000e033 0000000000000246 ffff880026625c48
> (XEN)    000000000000e02b ffffffffffffbeef ffffffffffffbeef ffffffffffffbeef
> (XEN)    ffffffffffffbeef ffffffff00000006 ffff8300ba712000 00000033ac85d080
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08016187b>] map_domain_page+0x1fb/0x4af
> (XEN)    [<ffff82d08017b69f>] do_mmu_update+0x6cb/0x19aa
> (XEN)    [<ffff82d0802248db>] syscall_enter+0xeb/0x145
> (XEN) 
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 6:
> (XEN) Assertion 'l1e_get_pfn(MAPCACHE_L1ENT(idx)) == mfn' failed at 
> domain_page.c:94
> (XEN) ****************************************

This second one provided more information than the first one,
and makes clear that the assertion indeed caught some (earlier)
corruption. The relevant piece of code from map_domain_page()
is (with actual value annotations)

FFFF82D08016183A        mov     esi, r14d               ; R14=00000012
FFFF82D08016183D        shl     rsi, 0C         ; RSI=00012000
FFFF82D080161841        mov     rdx, FFFF820040000000
FFFF82D08016184B        add     rsi, rdx                ; RSI=FFFF820040012000
FFFF82D08016184E        shl     rsi, 10         ; RSI=8200400120000000
FFFF82D080161852        shr     rsi, 19         ; RSI=4100200090
FFFF82D080161856        mov     rdx, 000FFFFFFFFFF000
FFFF82D080161860        mov     rcx, FFFF810000000000 ; LINEAR_PT_VIRT_START
FFFF82D08016186A        and     rdx, [rsi+rcx]  ; RSI=4100200090 
RCX=ffff810000000000 -> ffff814100200090
FFFF82D08016186E        shr     rdx, 0C
FFFF82D080161872        cmp     rax, rdx                ; RAX=00244dbd 
RDX=f820060006
                                                ; dcache->garbage = 
FFFF820060006000
FFFF82D080161875        je      FFFF82D080161AF5
FFFF82D08016187B ***    ud2

meaning that we found that something copied dcache->garbage
(a linear address) into __linear_l1_table[]. Since there's only a
single l1e_write() in domain_page.c that writes other than
l1e_empty(), and since that code (looking at the disassembly)
clearly doesn't use anything but the passed in value, I cannot
in any way see how this would be happening. Yet with the value
being one only ever used in domain_page.c, it's almost 100%
certain that it's the code here that does something wrong under
some specific condition.

The first crash, being on a different CPU, does an unmap for
the exact same MFN that the mapping is being done for above,
but - due to being on a different CPU - necessarily uses a
different entry and hence a different slot in the linear L1
table. With _both_ being corrupted, there must have been
more than a single bogus write earlier on.

The only debugging I see possible right now would be to
sanity check the whole involved linear L1 table range both on
entry and exit to/from {,un}map_domain_page(). But that
would likely have a sever performance impact, possibly hiding
the problem...

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.