[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] State of current Xen debugger



I am using 3.4.2 with some modifications

I added printks to the nmi_watchdog_tick as shown below.  I don't break the console lock.. but I am convinced that the printk lock is not the problem because I have also tested by having a void printk routine and it still hangs, so it felt pretty safe not breaking the lock.  I also tried the console_start/end_sync to make sure I was seeing all the messages when it hung.


void nmi_watchdog_tick(struct cpu_user_regs * regs)
{
    unsigned int sum = this_cpu(nmi_timer_ticks);

    if ( (this_cpu(last_irq_sums) == sum) &&
         !atomic_read(&watchdog_disable_count) )
    {
      if (sum > 20) {
        //      console_start_sync();
        printk("**** CPU%d, counter=%d, last_sum=%d, curr_sum=%d, hz=%d, nmis=%d\n",
               smp_processor_id(), this_cpu(alert_counter), this_cpu(last_irq_sums), sum, 5*nmi_hz,  nmi_count(smp_processor_id()) );
        //      console_end_sync();
      }
     
        /*
         * Ayiee, looks like this CPU is stuck ... wait a few IRQs (5 seconds)
         * before doing the oops ...
         */
        this_cpu(alert_counter)++;
        if ( this_cpu(alert_counter) == 5*nmi_hz )
        {
            console_force_unlock();
            printk("Watchdog timer detects that CPU%d is stuck!\n",
                   smp_processor_id());
            fatal_trap(TRAP_nmi, regs);
        }
    }
    else
    {
     
      if (sum > 20) {
        //      console_start_sync();
        printk("*CPU%d, counter=%d, last_sum=%d, curr_sum=%d, nmis=%d\n",
               smp_processor_id(), this_cpu(alert_counter), this_cpu(last_irq_sums), sum, nmi_count(smp_processor_id()) );
        //console_end_sync();
      }
     
        this_cpu(last_irq_sums) = sum;
        this_cpu(alert_counter) = 0;
    }


My messages stop printing and I get a hard hang.  the Performance Ctr NMI appears to come once every 4 seconds.  However, I have observed instances where they are about 10 seconds apart.  Not sure what is making the NMIs come in at uneven intervals.  As a test, I turned on SpeedStep and power management functions in the BIOS and it still hangs.

XEN) *CPU0, counter=0, last_sum=974, curr_sum=977, nmis=391
(XEN) *CPU0, counter=0, last_sum=977, curr_sum=979, nmis=392
(XEN) *CPU0, counter=0, last_sum=979, curr_sum=981, nmis=393
(XEN) *CPU0, counter=0, last_sum=981, curr_sum=984, nmis=394
(XEN) *CPU0, counter=0, last_sum=984, curr_sum=986, nmis=395
(XEN) *CPU0, counter=0, last_sum=986, curr_sum=988, nmis=396
(XEN) *CPU0, counter=0, last_sum=988, curr_sum=991, nmis=397
(XEN) *CPU0, counter=0, last_sum=991, curr_sum=993, nmis=398
(XEN) *CPU0, counter=0, last_sum=993, curr_sum=995, nmis=399
(XEN) *CPU0, counter=0, last_sum=995, curr_sum=997, nmis=400
(XEN) *CPU0, counter=0, last_sum=997, curr_sum=1000, nmis=401
(XEN) *CPU0, counter=0, last_sum=1000, curr_sum=1002, nmis=402
(XEN) *CPU0, counter=0, last_sum=1002, curr_sum=1005, nmis=403
(XEN) *CPU0, counter=0, last_sum=1005, curr_sum=1008, nmis=404
(XEN) *CPU0, counter=0, last_sum=1008, curr_sum=1010, nmis=405
(XEN) *CPU0, counter=0, last_sum=1010, curr_sum=1013, nmis=406
(XEN) *CPU0, counter=0, last_sum=1013, curr_sum=1015, nmis=407
(XEN) *CPU0, counter=0, last_sum=1015, curr_sum=1018, nmis=408
(XEN) *CPU0, counter=0, last_sum=1018, curr_sum=1020, nmis=409
(XEN) *CPU0, counter=0, last_sum=1020, curr_sum=1023, nmis=410
(XEN) *CPU0, counter=0, last_sum=1023, curr_sum=1026, nmis=411
(XEN) *CPU0, counter=0, last_sum=1026, curr_sum=1029, nmis=412
(XEN) *CPU0, counter=0, last_sum=1029, curr_sum=1031, nmis=413
(XEN) *CPU0, counter=0, last_sum=1031, curr_sum=1033, nmis=414
(XEN) *CPU0, counter=0, last_sum=1033, curr_sum=1035, nmis=415
(XEN) *CPU0, counter=0, last_sum=1035, curr_sum=1038, nmis=416
(XEN) *CPU0, counter=0, last_sum=1038, curr_sum=1041, nmis=417
(XEN) *CPU0, counter=0, last_sum=1041, curr_sum=1043, nmis=418
(XEN) *CPU0, counter=0, last_sum=1043, curr_sum=1046, nmis=419
(XEN) *CPU0, counter=0, last_sum=1046, curr_sum=1049, nmis=420
(XEN) *CPU0, counter=0, last_sum=1049, curr_sum=1051, nmis=421
(XEN) *CPU0, counter=0, last_sum=1051, curr_sum=1055, nmis=422
(XEN) *CPU0, counter=0, last_sum=1055, curr_sum=1058, nmis=423
(XEN) *CPU0, counter=0, last_sum=1058, curr_sum=1061, nmis=424
(XEN) *CPU0, counter=0, last_sum=1061, curr_sum=1064, nmis=425
(XEN) *CPU0, counter=0, last_sum=1064, curr_sum=1067, nmis=426
(XEN) *CPU0, counter=0, last_sum=1067, curr_sum=1070, nmis=427
(XEN) *CPU0, counter=0, last_sum=1070, curr_sum=1073, nmis=428
(XEN) *CPU0, counter=0, last_sum=1073, curr_sum=1076, nmis=429
 __  __            _____ _  _    ____
 \ \/ /___ _ __   |___ /| || |  |___ \
  \  // _ \ '_ \    |_ \| || |_   __) |
  /  \  __/ | | |  ___) |__   _| / __/
 /_/\_\___|_| |_| |____(_) |_|(_)_____|
                                     
(XEN) Xen version 3.4.2 (rcruz@) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) Mon Sep 13 23:06:17 UTC 2010
(XEN) Latest ChangeSet: Mon Sep 13 16:12:14 2010 -0400 132:a499dd8fcb55


-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@xxxxxxxxxx]
Sent: Tue 9/14/2010 11:20 AM
To: Roger Cruz
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] State of current Xen debugger

At 15:56 +0100 on 14 Sep (1284479787), Roger Cruz wrote:
> I had a pretty good inkling that one of you hardcore developers would
> say that :-) Yes, it is pretty well wedged.  I can cause the problem
> more rapidly by dropping to a single CPU.  When the hang happens, the
> Xen console is completely dead.  None of the special keys work.

If the 'd' key doesn't work then the serial irq isn't getting handled,
so the CPU is wedged at a higher TPR (at least).  Usually in that case
the CPU is spinning so the NMI watchdog timer kicks in OK; possibly if
it was idle with a high TPR it wouldn't.

What version of Xen are you using? 

It might be worth trying a boot with MSI disabled (there were reports at
one stage of MSIs not being EOI'd because the timer interupt that would
remind Xen to EOI them was at a lower priority than the MSI).

> I do have hopes a BIOS upgrade could fix this as a last resort but I
> want to see if at least I can understand the problem.  We have a few
> different machines that are exhibiting similar symptoms so I have to
> see if I can find a work-around without requiring every user to
> upgrade their BIOS :-(
>
> Just in case, what debugger have you been using?  Are there recent
> instructions on how to set it up that you can point me to?

I don't use a debugger on Xen.  I usually find that by the time the
debugger kicks in it's too late to help, so I end up finding bugs by
code inspection and printks. :)

Mukesh Rathor at Oracle has done some debugger work, though, including
an in-Xen debugger.  There's a gdb stub too but I suspect it's rotted
quite badly.

Cheers,

Tim.

> Thanks
> Roger
>
>
> -----Original Message-----
> From: Tim Deegan [mailto:Tim.Deegan@xxxxxxxxxx]
> Sent: Tue 9/14/2010 10:30 AM
> To: Roger Cruz
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] State of current Xen debugger
>
> Hi,
>
> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> > I am trying to debug a problem where the hypervisor is hanging hard.
> > Not even the NMI watchdog is triggering a reboot.  So I wanted to hook
> > up a debugger.
>
> Sorry to bring a counsel of despair but if the NMI watchdog isn't
> working then your chances of getting a working debugger are slim.  It's
> likely that at least one CPU is very very stuck.  Does the 'd' debug key
> work on the serial line when the machine is wedged?
>
> On a more cheerful note, I've twice seen hard hangs like this that
> turned out to be hardware issues, fixable with BIOS upgrades.
>
> Cheers,
>
> Tim.
>
> > What is the state of the current debuggers out there?
> > Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> > good wiki page are much appreciated.  I did perform a Google search
> > and found some links but I want to hear from the current developers as
> > to what is most stable and useful for debugging this type of hard
> > hang.  I only have a serial port PCI-express card to use as the laptop
> > has no built in port.
>
> --
> Tim Deegan <Tim.Deegan@xxxxxxxxxx>
> Principal Software Engineer, XenServer Engineering
> Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00
>

--
Tim Deegan <Tim.Deegan@xxxxxxxxxx>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.