[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] page faults on machines with > 4TB memory



On Thu, Jul 23, 2015 at 06:01:45PM +0100, Andrew Cooper wrote:
> On 23/07/15 17:35, Elena Ufimtseva wrote:
> > Hi
> >
> > While working on bugs during boot time on large oracle server x4-8,
> > There is a problem with booting Xen on large machines with > 4TB memory,
> > such as Oracle x4-8.
> > The page fault occured initially while loading xen pm info into hypervisor
> > (you can see it in serial log attahced named 4.4.2_no_mem_override).
> > Tracing down an issue shows that page fault occures in timer.c code
> > while getting heap size.
> >
> > Here is the original call trace:
> > rocessor: Uploading Xen processor PM info 
> > @ (XEN) ----[ Xen-4.4.3-preOVM  x86_64  debug=n  Tainted:    C ]---- 
> > @ (XEN) CPU:    0 
> > @ (XEN) RIP:    e008:[<ffff82d08022e747>] add_entry+0x27/0x120 
> > @ (XEN) RFLAGS: 0000000000010082   CONTEXT: hypervisor 
> > @ (XEN) rax: ffff8a2d080513a20   rbx: ffff83808e802300   rcx:
> > 00000000000000e8 
> > @ (XEN) rdx: 00000000000000e8   rsi: 00000000000000e8   rdi:
> > ffff83808e802300 
> > @ (XEN) rbp: ffff82d080513a20   rsp: ffff82d0804d7c70   r8:
> > ffff8840ffdb5010 
> > @ (XEN) r9:  0000000000000017   r10: ffff83808e802180   r11:
> > 0200200200200200 
> > @ (XEN) r12: ffff82d080533080   r13: 0000000000000296   r14:
> > 0100100100100100 
> > @ (XEN) r15: 00000000000000e8   cr0: 0000000080050033   cr4:
> > 00000000001526f0 
> > @ (XEN) cr3: 00000100818b2000   cr2: ffff8840ffdb5010 
> > @ (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008 
> > @ (XEN) Xen stack trace from rsp=ffff82d0804d7c70: 
> > @ (XEN)    ffff83808e802300 ffff82d080513a20 ffff82d08022f59b
> > ffff82d080533080 
> > @ (XEN)    ffff82d080532f50 00000000000000e8 ffff83808e802328
> > 0000000000000000 
> > @ (XEN)    ffff82d080513a20 ffff83808e8022c0 ffff82d080533200
> > 00000000000000e8 
> > @ (XEN)    00000000000000f0 ffff82d0805331c0 ffff82d0802458e2
> > 0000000000000000 
> > @ (XEN)    00000000000000e8 ffff83808e802334 ffff8384be7979b0
> > ffff82d0804d7d78 
> > @ (XEN)    0000000000000000 ffff8384be77c700 ffff82d0804d7d78
> > ffff82d080513a20 
> > @ (XEN)    ffff82d080246207 00000000000000e8 00000000000000e8
> > ffff8384be7979b0 
> > @ (XEN)    ffff82d08024518a ffff82d080533080 0000000000000070
> > ffff82d080533da8 
> > @ (XEN)    00000001000000e8 ffff8384be797a00 000000e800000001
> > 002ab980002abd68 
> > @ (XEN)    0000271000124f80 002abd6800124f80 00000000002ab980
> > ffff82d0803753e0 
> > @ (XEN)    0000000000010101 0000000000000001 ffff82d0804d7e18
> > ffff881fb4afbc88 
> > @ (XEN)    ffff82d0804d0000 ffff881fb28a4400 ffff82d0804fca80
> > ffffffff819b7080 
> > @ (XEN)    ffff82d080266c16 ffff83808fb46ba8 ffff82d080208a82
> > ffff83006bddd190 
> > @ (XEN)    0000000000000292 0300000100000036 00000001000000f6
> > 000000000000000f 
> > @ (XEN)    0000007f000c0082 0000000000000000 0000007f000c0082
> > 0000000000000000 
> > @ (XEN)    000000000000000a ffff881fb28a4400 0000000000000005
> > 0000000000000000 
> > @ (XEN)    0000000000000000 00000000000000fe 0000000000000001
> > 0000000000000001 
> > @ (XEN)    0000000000000000 0000000000000000 ffff82d08031f521
> > 0000000000000000 
> > @ (XEN)    0000000000000246 ffffffff810010ea 0000000000000000
> > ffffffff810010ea 
> > @ (XEN)    000000000000e030 0000000000000246 ffff83006bddd000
> > ffff881fb4afbd48 
> > @ (XEN) Xen call trace: 
> > @ (XEN)    [<ffff82d08022e747>] add_entry+0x27/0x120 
> > @ (XEN)    [<ffff82d08022f59b>] set_timer+0x10b/0x220 
> > @ (XEN)    [<ffff82d0802458e2>] cpufreq_governor_dbs+0x1e2/0x2f0 
> > @ (XEN)    [<ffff82d080246207>] __cpufreq_set_policy+0x87/0x120 
> > @ (XEN)    [<ffff82d08024518a>] cpufreq_add_cpu+0x24a/0x4f0 
> > @ (XEN)    [<ffff82d080266c16>] do_platform_op+0x9c6/0x1650 
> > @ (XEN)    [<ffff82d080208a82>] evtchn_check_pollers+0x22/0xb0 
> > @ (XEN)    [<ffff82d08031f521>] do_iret+0xc1/0x1a0 
> > @ (XEN)    [<ffff82d0803243a9>] syscall_enter+0xa9/0xae 
> > @ (XEN) 
> > @ (XEN) Pagetable walk from ffff8840ffdb5010: 
> > @ (XEN)  L4[0x110] = 00000100818b3067 00000000000018b3 
> > @ (XEN)  L3[0x103] = 0000000000000000 ffffffffffffffff 
> > @ (XEN) 
> > @ (XEN) ****************************************
> >
> > 0xffff82d08022e720 <add_entry>: movzwl 0x28(%rdi),%edx
> >    0xffff82d08022e724 <add_entry+4>:    push   %rbp
> >    0xffff82d08022e725 <add_entry+5>:    
> >     lea    0x2e52f4(%rip),%rax        # 0xffff82d080513a20 
> > <__per_cpu_offset>
> >    0xffff82d08022e72c <add_entry+12>:   
> >     lea    0x30494d(%rip),%r10        # 0xffff82d080533080 <per_cpu__timers>
> >    0xffff82d08022e733 <add_entry+19>:   push   %rbx
> >    0xffff82d08022e734 <add_entry+20>:   add    (%rax,%rdx,8),%r10
> >    0xffff82d08022e738 <add_entry+24>:   movl   $0x0,0x8(%rdi)
> >    0xffff82d08022e73f <add_entry+31>:   movb   $0x3,0x2a(%rdi)
> >    0xffff82d08022e743 <add_entry+35>:   mov    0x8(%r10),%r8
> >    0xffff82d08022e747 <add_entry+39>:   movzwl (%r8),%ecx
> >
> > And this points to 
> > int sz = GET_HEAP_SIZE(heap);
> > in add_entry of timer.c.
> >
> > static int add_entry(struct timer *t)                                       
> >     
> > {                                                                           
> >     
> > ffff82d08022cad3:   53                      push   %rbx                     
> >     
> >     struct timers *timers = &per_cpu(timers, t->cpu);                       
> >     
> > ffff82d08022cad4:   4c 03 14 d0             add    (%rax,%rdx,8),%r10       
> >     
> >     int rc;                                                                 
> >     
> >                                                                             
> >     
> >     ASSERT(t->status == TIMER_STATUS_invalid);                              
> >     
> >                                                                             
> >     
> >     /* Try to add to heap. t->heap_offset indicates whether we succeed. */  
> >     
> >     t->heap_offset = 0;                                                     
> >     
> > ffff82d08022cad8:   c7 47 08 00 00 00 00    movl   $0x0,0x8(%rdi)           
> >     
> >     t->status = TIMER_STATUS_in_heap;                                       
> >     
> > ffff82d08022cadf:   c6 47 2a 03             movb   $0x3,0x2a(%rdi)          
> >     
> >     rc = add_to_heap(timers->heap, t);                                      
> >     
> > ffff82d08022cae3:   4d 8b 42 08             mov    0x8(%r10),%r8            
> >     
> >                                                                             
> >     
> >                                                                             
> >     
> > /* Add new entry @t to @heap. Return TRUE if new top of heap. */            
> >     
> > static int add_to_heap(struct timer **heap, struct timer *t)                
> >     
> > {                                                                           
> >     
> >     int sz = GET_HEAP_SIZE(heap);                                           
> >     
> > ffff82d08022cae7:   41 0f b7 08             movzwl (%r8),%ecx               
> >     
> >                                                                             
> >     
> >     /* Fail if the heap is full. */                                         
> >     
> >     if ( unlikely(sz == GET_HEAP_LIMIT(heap)) )    
> >
> > But checking values for nr_cpumask_bits, nr_cpu_ids and NR_CPUS did not
> > provide any clues on why it fails here.
> >
> > After disabling xen cpufreq in linux, the page fault did not appear, but
> > creating new guest caused another fatal page fault:
> >
> > CPU:    0 
> > @ (XEN) RIP:    e008:[<ffff82d08025d59b>] __find_first_bit+0xb/0x30 
> > @ (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor 
> > @ (XEN) rax: 0000000000000000   rbx: 00000000ffdb53c0   rcx: 
> > 0000000000000004 
> > @ (XEN) rdx: ffff82d080513a20   rsi: 00000000000000f0   rdi: 
> > ffff8840ffdb53c0 
> > @ (XEN) rbp: 00000000000000e9   rsp: ffff82d0804d7d88   r8:  
> > 0000000000000000 
> > @ (XEN) r9:  0000000000000000   r10: 0000000000000017   r11: 
> > 0000000000000000 
> > @ (XEN) r12: ffff8381875ee3e0   r13: ffff82d0804d7e98   r14: 
> > 00000000000000e9 
> > @ (XEN) r15: 00000000000000f0   cr0: 0000000080050033   cr4: 
> > 00000000001526f0 
> > @ (XEN) cr3: 0000008174093000   cr2: ffff8840ffdb53c0 
> > @ (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008 
> > @ (XEN) Xen stack trace from rsp=ffff82d0804d7d88: 
> > @ (XEN)    00000000000000e7 ffff82d080206030 000000cf7d47d0a2 
> > 00000000000000e9 
> > @ (XEN)    00000000000000f0 0000000000000002 ffff83808fb6ffd0 
> > ffff82d080533db8 
> > @ (XEN)    0000000000000000 ffff82d080532f50 ffff82d0804d0000 
> > ffff82d080533db8 
> > @ (XEN)    00007fa8c83e5004 ffff82d0804d7e08 ffff82d080533db8 
> > ffff83818b4e5000 
> > @ (XEN)    000000090000000f 00007fa8c8390001 00007fa800000002 
> > 00007fa8ae7f8eb8 
> > @ (XEN)    0000000000000002 00007fa898004170 000000000159c320 
> > 00000034ccc6cffe 
> > @ (XEN)    00007fa8c83e5000 0000000000000000 000000000159c320 
> > fffffc73ffffffff 
> > @ (XEN)    00000034ccf6e920 00000034ccf6e920 00000034ccf6e920 
> > 00000034ccc94298 
> > @ (XEN)    00007fa898004170 00000034ccc94220 ffffffffffffffff 
> > ffffffffffffffff 
> > @ (XEN)    ffffffffffffffff 000000ffffffffff 00000034ca0e08c7 
> > 0000000000000100 
> > @ (XEN)    00000034ca0e08c7 0000000000000033 0000000000000246 
> > ffff83006bddd000 
> > @ (XEN)    ffff8808456f1e98 00007fa8ae7f8d90 ffff88084ad1d900 
> > 0000000000000001 
> > @ (XEN)    00007fa8ae7f8d90 ffff82d0803243a9 00000000ffffffff 
> > 0000000001d0085c 
> > @ (XEN)    00007fa8c84549c0 00007fa898004170 ffff8808456f1e98 
> > 00007fa8ae7f8d90 
> > @ (XEN)    0000000000000282 00000000019c9998 0000000000000003 
> > 0000000001d00a49 
> > @ (XEN)    0000000000000024 ffffffff8100148a 00007fa898004170 
> > 00007fa8ae7f8ed0 
> > @ (XEN)    00007fa8c83e5004 0001010000000000 ffffffff8100148a 
> > 000000000000e033 
> > @ (XEN)    0000000000000282 ffff8808456f1e40 000000000000e02b 
> > 0000000000000000 
> > @ (XEN)    0000000000000000 0000000000000000 0000000000000000 
> > 0000000000000000 
> > @ (XEN)    ffff83006bddd000 0000000000000000 0000000000000000 
> > @ (XEN) Xen call trace: 
> > @ (XEN)    [<ffff82d08025d59b>] __find_first_bit+0xb/0x30 
> > @ (XEN)    [<ffff82d080206030>] do_domctl+0x12b0/0x13d0 
> > @ (XEN)    [<ffff82d0803243a9>] syscall_enter+0xa9/0xae 
> > @ (XEN) 
> > @ (XEN) Pagetable walk from ffff8840ffdb53c0: 
> > @ (XEN)  L4[0x110] = 00000080818b3067 00000000000018b3
> >
> > While booting upstream on the same server (same command line as in other 
> > cases)
> > causes another page fault (see attaches upstream_no_mem_override.log);
> >
> > We remembered there there is another open bug about a problem when starting 
> > with more than 4 TB memory. The workaround for this was to override mem at 
> > Xen command line. Tried this, and with upstream Xen and one that 4.4.3 with 
> > enabled cpufreq linux driver, problem dissapears. See attached logs 
> > upstream_with_mem_override.log and 4.4.3_with_mem_overrride.log.
> >
> > Any information on what can be an issue here or any other pointers will be 
> > very helpful.
> > I will provide additional info if needed.
> >
> > Thank you
> > Elena
> 
> This is an issue we have found in XenServer as well.
> 
> Observe that ffff8840ffdb53c0 is actually a pointer in the 64bit PV
> virtual region, because the xenheap allocator has wandered off the top
> of the directmap region.  This is a direct result of passing numa node
> information to alloc_xenheap_page(), which overrides the check which
> keeps the allocation inside the directmap region.

Thanks Andrew.
Ok, that explains why the address looked odd.
> 
> I have worked around in XenServer with
> 
> diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
> index 3c64f19..715765a 100644
> --- a/xen/arch/x86/e820.c
> +++ b/xen/arch/x86/e820.c
> @@ -15,7 +15,7 @@
>   * opt_mem: Limit maximum address of physical RAM.
>   *          Any RAM beyond this address limit is ignored.
>   */
> -static unsigned long long __initdata opt_mem;
> +static unsigned long long __initdata opt_mem = GB(5 * 1024);
>  size_param("mem", opt_mem);
>  
>  /*
> 
> This causes Xen to ignore any RAM above the top of the directmap region,
> which happens to be 5TiB on Xen 4.5.

Yes, looks like mem override is a current workaround in our case too.
> 
> In some copious free time, I was going to look into segmenting the
> directmap region by numa node, rather than having it linear from 0, so
> xenheap pages can still be properly numa-located.

Thanks Andrew.
> 
> ~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.