Xen project Mailing List

Re: [Xen-devel] page faults on machines with > 4TB memory

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Elena Ufimtseva <elena.ufimtseva@xxxxxxxxxx>

Date: Thu, 23 Jul 2015 13:45:22 -0400

Cc: adnan.misherfi@xxxxxxxxxx, jbeulich@xxxxxxxx, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 23 Jul 2015 17:44:13 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, Jul 23, 2015 at 06:01:45PM +0100, Andrew Cooper wrote: > On 23/07/15 17:35, Elena Ufimtseva wrote: > > Hi > > > > While working on bugs during boot time on large oracle server x4-8, > > There is a problem with booting Xen on large machines with > 4TB memory, > > such as Oracle x4-8. > > The page fault occured initially while loading xen pm info into hypervisor > > (you can see it in serial log attahced named 4.4.2_no_mem_override). > > Tracing down an issue shows that page fault occures in timer.c code > > while getting heap size. > > > > Here is the original call trace: > > rocessor: Uploading Xen processor PM info > > @ (XEN) ----[ Xen-4.4.3-preOVM x86_64 debug=n Tainted: C ]---- > > @ (XEN) CPU: 0 > > @ (XEN) RIP: e008:[<ffff82d08022e747>] add_entry+0x27/0x120 > > @ (XEN) RFLAGS: 0000000000010082 CONTEXT: hypervisor > > @ (XEN) rax: ffff8a2d080513a20 rbx: ffff83808e802300 rcx: > > 00000000000000e8 > > @ (XEN) rdx: 00000000000000e8 rsi: 00000000000000e8 rdi: > > ffff83808e802300 > > @ (XEN) rbp: ffff82d080513a20 rsp: ffff82d0804d7c70 r8: > > ffff8840ffdb5010 > > @ (XEN) r9: 0000000000000017 r10: ffff83808e802180 r11: > > 0200200200200200 > > @ (XEN) r12: ffff82d080533080 r13: 0000000000000296 r14: > > 0100100100100100 > > @ (XEN) r15: 00000000000000e8 cr0: 0000000080050033 cr4: > > 00000000001526f0 > > @ (XEN) cr3: 00000100818b2000 cr2: ffff8840ffdb5010 > > @ (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > > @ (XEN) Xen stack trace from rsp=ffff82d0804d7c70: > > @ (XEN) ffff83808e802300 ffff82d080513a20 ffff82d08022f59b > > ffff82d080533080 > > @ (XEN) ffff82d080532f50 00000000000000e8 ffff83808e802328 > > 0000000000000000 > > @ (XEN) ffff82d080513a20 ffff83808e8022c0 ffff82d080533200 > > 00000000000000e8 > > @ (XEN) 00000000000000f0 ffff82d0805331c0 ffff82d0802458e2 > > 0000000000000000 > > @ (XEN) 00000000000000e8 ffff83808e802334 ffff8384be7979b0 > > ffff82d0804d7d78 > > @ (XEN) 0000000000000000 ffff8384be77c700 ffff82d0804d7d78 > > ffff82d080513a20 > > @ (XEN) ffff82d080246207 00000000000000e8 00000000000000e8 > > ffff8384be7979b0 > > @ (XEN) ffff82d08024518a ffff82d080533080 0000000000000070 > > ffff82d080533da8 > > @ (XEN) 00000001000000e8 ffff8384be797a00 000000e800000001 > > 002ab980002abd68 > > @ (XEN) 0000271000124f80 002abd6800124f80 00000000002ab980 > > ffff82d0803753e0 > > @ (XEN) 0000000000010101 0000000000000001 ffff82d0804d7e18 > > ffff881fb4afbc88 > > @ (XEN) ffff82d0804d0000 ffff881fb28a4400 ffff82d0804fca80 > > ffffffff819b7080 > > @ (XEN) ffff82d080266c16 ffff83808fb46ba8 ffff82d080208a82 > > ffff83006bddd190 > > @ (XEN) 0000000000000292 0300000100000036 00000001000000f6 > > 000000000000000f > > @ (XEN) 0000007f000c0082 0000000000000000 0000007f000c0082 > > 0000000000000000 > > @ (XEN) 000000000000000a ffff881fb28a4400 0000000000000005 > > 0000000000000000 > > @ (XEN) 0000000000000000 00000000000000fe 0000000000000001 > > 0000000000000001 > > @ (XEN) 0000000000000000 0000000000000000 ffff82d08031f521 > > 0000000000000000 > > @ (XEN) 0000000000000246 ffffffff810010ea 0000000000000000 > > ffffffff810010ea > > @ (XEN) 000000000000e030 0000000000000246 ffff83006bddd000 > > ffff881fb4afbd48 > > @ (XEN) Xen call trace: > > @ (XEN) [<ffff82d08022e747>] add_entry+0x27/0x120 > > @ (XEN) [<ffff82d08022f59b>] set_timer+0x10b/0x220 > > @ (XEN) [<ffff82d0802458e2>] cpufreq_governor_dbs+0x1e2/0x2f0 > > @ (XEN) [<ffff82d080246207>] __cpufreq_set_policy+0x87/0x120 > > @ (XEN) [<ffff82d08024518a>] cpufreq_add_cpu+0x24a/0x4f0 > > @ (XEN) [<ffff82d080266c16>] do_platform_op+0x9c6/0x1650 > > @ (XEN) [<ffff82d080208a82>] evtchn_check_pollers+0x22/0xb0 > > @ (XEN) [<ffff82d08031f521>] do_iret+0xc1/0x1a0 > > @ (XEN) [<ffff82d0803243a9>] syscall_enter+0xa9/0xae > > @ (XEN) > > @ (XEN) Pagetable walk from ffff8840ffdb5010: > > @ (XEN) L4[0x110] = 00000100818b3067 00000000000018b3 > > @ (XEN) L3[0x103] = 0000000000000000 ffffffffffffffff > > @ (XEN) > > @ (XEN) **************************************** > > > > 0xffff82d08022e720 <add_entry>: movzwl 0x28(%rdi),%edx > > 0xffff82d08022e724 <add_entry+4>: push %rbp > > 0xffff82d08022e725 <add_entry+5>: > > lea 0x2e52f4(%rip),%rax # 0xffff82d080513a20 > > <__per_cpu_offset> > > 0xffff82d08022e72c <add_entry+12>: > > lea 0x30494d(%rip),%r10 # 0xffff82d080533080 <per_cpu__timers> > > 0xffff82d08022e733 <add_entry+19>: push %rbx > > 0xffff82d08022e734 <add_entry+20>: add (%rax,%rdx,8),%r10 > > 0xffff82d08022e738 <add_entry+24>: movl $0x0,0x8(%rdi) > > 0xffff82d08022e73f <add_entry+31>: movb $0x3,0x2a(%rdi) > > 0xffff82d08022e743 <add_entry+35>: mov 0x8(%r10),%r8 > > 0xffff82d08022e747 <add_entry+39>: movzwl (%r8),%ecx > > > > And this points to > > int sz = GET_HEAP_SIZE(heap); > > in add_entry of timer.c. > > > > static int add_entry(struct timer *t) > > > > { > > > > ffff82d08022cad3: 53 push %rbx > > > > struct timers *timers = &per_cpu(timers, t->cpu); > > > > ffff82d08022cad4: 4c 03 14 d0 add (%rax,%rdx,8),%r10 > > > > int rc; > > > > > > > > ASSERT(t->status == TIMER_STATUS_invalid); > > > > > > > > /* Try to add to heap. t->heap_offset indicates whether we succeed. */ > > > > t->heap_offset = 0; > > > > ffff82d08022cad8: c7 47 08 00 00 00 00 movl $0x0,0x8(%rdi) > > > > t->status = TIMER_STATUS_in_heap; > > > > ffff82d08022cadf: c6 47 2a 03 movb $0x3,0x2a(%rdi) > > > > rc = add_to_heap(timers->heap, t); > > > > ffff82d08022cae3: 4d 8b 42 08 mov 0x8(%r10),%r8 > > > > > > > > > > > > /* Add new entry @t to @heap. Return TRUE if new top of heap. */ > > > > static int add_to_heap(struct timer **heap, struct timer *t) > > > > { > > > > int sz = GET_HEAP_SIZE(heap); > > > > ffff82d08022cae7: 41 0f b7 08 movzwl (%r8),%ecx > > > > > > > > /* Fail if the heap is full. */ > > > > if ( unlikely(sz == GET_HEAP_LIMIT(heap)) ) > > > > But checking values for nr_cpumask_bits, nr_cpu_ids and NR_CPUS did not > > provide any clues on why it fails here. > > > > After disabling xen cpufreq in linux, the page fault did not appear, but > > creating new guest caused another fatal page fault: > > > > CPU: 0 > > @ (XEN) RIP: e008:[<ffff82d08025d59b>] __find_first_bit+0xb/0x30 > > @ (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > > @ (XEN) rax: 0000000000000000 rbx: 00000000ffdb53c0 rcx: > > 0000000000000004 > > @ (XEN) rdx: ffff82d080513a20 rsi: 00000000000000f0 rdi: > > ffff8840ffdb53c0 > > @ (XEN) rbp: 00000000000000e9 rsp: ffff82d0804d7d88 r8: > > 0000000000000000 > > @ (XEN) r9: 0000000000000000 r10: 0000000000000017 r11: > > 0000000000000000 > > @ (XEN) r12: ffff8381875ee3e0 r13: ffff82d0804d7e98 r14: > > 00000000000000e9 > > @ (XEN) r15: 00000000000000f0 cr0: 0000000080050033 cr4: > > 00000000001526f0 > > @ (XEN) cr3: 0000008174093000 cr2: ffff8840ffdb53c0 > > @ (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > > @ (XEN) Xen stack trace from rsp=ffff82d0804d7d88: > > @ (XEN) 00000000000000e7 ffff82d080206030 000000cf7d47d0a2 > > 00000000000000e9 > > @ (XEN) 00000000000000f0 0000000000000002 ffff83808fb6ffd0 > > ffff82d080533db8 > > @ (XEN) 0000000000000000 ffff82d080532f50 ffff82d0804d0000 > > ffff82d080533db8 > > @ (XEN) 00007fa8c83e5004 ffff82d0804d7e08 ffff82d080533db8 > > ffff83818b4e5000 > > @ (XEN) 000000090000000f 00007fa8c8390001 00007fa800000002 > > 00007fa8ae7f8eb8 > > @ (XEN) 0000000000000002 00007fa898004170 000000000159c320 > > 00000034ccc6cffe > > @ (XEN) 00007fa8c83e5000 0000000000000000 000000000159c320 > > fffffc73ffffffff > > @ (XEN) 00000034ccf6e920 00000034ccf6e920 00000034ccf6e920 > > 00000034ccc94298 > > @ (XEN) 00007fa898004170 00000034ccc94220 ffffffffffffffff > > ffffffffffffffff > > @ (XEN) ffffffffffffffff 000000ffffffffff 00000034ca0e08c7 > > 0000000000000100 > > @ (XEN) 00000034ca0e08c7 0000000000000033 0000000000000246 > > ffff83006bddd000 > > @ (XEN) ffff8808456f1e98 00007fa8ae7f8d90 ffff88084ad1d900 > > 0000000000000001 > > @ (XEN) 00007fa8ae7f8d90 ffff82d0803243a9 00000000ffffffff > > 0000000001d0085c > > @ (XEN) 00007fa8c84549c0 00007fa898004170 ffff8808456f1e98 > > 00007fa8ae7f8d90 > > @ (XEN) 0000000000000282 00000000019c9998 0000000000000003 > > 0000000001d00a49 > > @ (XEN) 0000000000000024 ffffffff8100148a 00007fa898004170 > > 00007fa8ae7f8ed0 > > @ (XEN) 00007fa8c83e5004 0001010000000000 ffffffff8100148a > > 000000000000e033 > > @ (XEN) 0000000000000282 ffff8808456f1e40 000000000000e02b > > 0000000000000000 > > @ (XEN) 0000000000000000 0000000000000000 0000000000000000 > > 0000000000000000 > > @ (XEN) ffff83006bddd000 0000000000000000 0000000000000000 > > @ (XEN) Xen call trace: > > @ (XEN) [<ffff82d08025d59b>] __find_first_bit+0xb/0x30 > > @ (XEN) [<ffff82d080206030>] do_domctl+0x12b0/0x13d0 > > @ (XEN) [<ffff82d0803243a9>] syscall_enter+0xa9/0xae > > @ (XEN) > > @ (XEN) Pagetable walk from ffff8840ffdb53c0: > > @ (XEN) L4[0x110] = 00000080818b3067 00000000000018b3 > > > > While booting upstream on the same server (same command line as in other > > cases) > > causes another page fault (see attaches upstream_no_mem_override.log); > > > > We remembered there there is another open bug about a problem when starting > > with more than 4 TB memory. The workaround for this was to override mem at > > Xen command line. Tried this, and with upstream Xen and one that 4.4.3 with > > enabled cpufreq linux driver, problem dissapears. See attached logs > > upstream_with_mem_override.log and 4.4.3_with_mem_overrride.log. > > > > Any information on what can be an issue here or any other pointers will be > > very helpful. > > I will provide additional info if needed. > > > > Thank you > > Elena > > This is an issue we have found in XenServer as well. > > Observe that ffff8840ffdb53c0 is actually a pointer in the 64bit PV > virtual region, because the xenheap allocator has wandered off the top > of the directmap region. This is a direct result of passing numa node > information to alloc_xenheap_page(), which overrides the check which > keeps the allocation inside the directmap region. Thanks Andrew. Ok, that explains why the address looked odd. > > I have worked around in XenServer with > > diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c > index 3c64f19..715765a 100644 > --- a/xen/arch/x86/e820.c > +++ b/xen/arch/x86/e820.c > @@ -15,7 +15,7 @@ > * opt_mem: Limit maximum address of physical RAM. > * Any RAM beyond this address limit is ignored. > */ > -static unsigned long long __initdata opt_mem; > +static unsigned long long __initdata opt_mem = GB(5 * 1024); > size_param("mem", opt_mem); > > /* > > This causes Xen to ignore any RAM above the top of the directmap region, > which happens to be 5TiB on Xen 4.5. Yes, looks like mem override is a current workaround in our case too. > > In some copious free time, I was going to look into segmenting the > directmap region by numa node, rather than having it linear from 0, so > xenheap pages can still be properly numa-located. Thanks Andrew. > > ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.