[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast



On Wed, Sep 29, 2010 at 08:34:28PM +0100, Andrew Lyon wrote:
> On Wed, Sep 29, 2010 at 7:08 PM, Andreas Kinzler <ml-xen-devel@xxxxxx> wrote:
> > On 21.09.2010 13:56, Pasi Kärkkäinen wrote:
> >>>
> >>>  I am talking a while (via email) with Jan now to track the following
> >>> problem and he suggested that I report the problem on xen-devel:
> >>>
> >>> Jul  9 01:48:04 virt kernel: aacraid: Host adapter reset request. SCSI
> >>> hang ?
> >>> Jul  9 01:49:05 virt kernel: aacraid: SCSI bus appears hung
> >>> Jul  9 01:49:10 virt kernel: Calling adapter init
> >>> Jul  9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not
> >>> guaranteed on shared IRQs
> >>> Jul  9 01:49:49 virt kernel: Acquiring adapter information
> >>> Jul  9 01:49:49 virt kernel: update_interval=30:00 check_interval=86400s
> >>> Jul  9 01:53:13 virt kernel: aacraid: aac_fib_send: first asynchronous
> >>> command timed out.
> >>> Jul  9 01:53:13 virt kernel: Usually a result of a PCI interrupt routing
> >>> problem;
> >>> Jul  9 01:53:13 virt kernel: update mother board BIOS or consider
> >>> utilizing one of
> >>> Jul  9 01:53:13 virt kernel: the SAFE mode kernel options (acpi, apic
> >>> etc)
> >>>
> >>> After the VMs have been running a while the aacraid driver reports a
> >>> non-responding RAID controller. Most of the time the NIC is also no
> >>> longer working.
> >>> I nearly tried every combination of dom0 kernel (pvops0, xenfied suse
> >>> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen
> >>> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable.
> >>> No success in two month. Every combination earlier or later had the
> >>> problem shown above. I did extensive tests to make sure that the
> >>> hardware is OK. And it is - I am sure it is a Xen/dom0 problem.
> >>>
> >>> Jan suggested to try the fix in c/s 22051 but it did not help. My answer
> >>> to him:
> >>>
> >>>> In the meantime I did try xen-unstable c/s 22068 (contains staging c/s
> >>>
> >>> 22051) and
> >>>>
> >>>> it did not fix the problem at all. I was able to fix a problem with
> >>>
> >>> the serial console
> >>>>
> >>>> and so I got some debug info that is attached to this email. The
> >>>
> >>> following line looks
> >>>>
> >>>> suspicious to me (irr=1, delivery_status=1):
> >>>
> >>>> (XEN)     IRQ 16 Vec216:
> >>>> (XEN)       Apic 0x00, Pin 16: vector=216, delivery_mode=1,
> >>>
> >>> dest_mode=logical,
> >>>>
> >>>>             delivery_status=1, polarity=1, irr=1, trigger=level,
> >>>
> >>> mask=0, dest_id:1
> >>>
> >>>> IRQ 16 is the aacraid controller which after some while seems to be
> >>>
> >>> enable to receive
> >>>>
> >>>> interrupts. Can you see from the debug info what is going on?
> >>>
> >>> I also applied a small patch which disables HPET broadcast. The machine
> >>> is now running
> >>> for 110 hours without a crash while normally it crashes within a few
> >>> minutes. Is there
> >>> something wrong (race, deadlock) with HPET broadcasts in relation to
> >>> blocked interrupt
> >>> reception (see above)?
> >>
> >> What kind of hardware does this happen on?
> >
> > It is a Supermicro X8SIL-F, Intel Xeon 3450 system.
> >
> >> Should this patch be merged?
> >
> > Not easy to answer. I spend more than 10 weeks searching nearly full time
> > for the reason of the stability issues. Finally I was able to track it down
> > to the HPET broadcast code.
> >
> > We need to find the developer of the HPET broadcast code. Then, he should
> > try to fix the code. I consider it a quite severe bug as it renders Xen
> > nearly useless on affected systems. That is why I (and my boss who pays me)
> > spend so much time (developing/fixing Xen is not really my core job) and
> > money (buying a E5620 machine just for testing Xen).
> >
> > I think many people on affected systems are having problems. See
> > http://lists.xensource.com/archives/html/xen-users/2010-09/msg00370.html
> >
> > Regards Andreas
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > http://lists.xensource.com/xen-devel
> >
> 
> I will test that patch on my Supermicro X7DWA-N based dual Xeon
> workstation, I always use a Xenified kernel rather than pv_ops as it
> supports some features that I need and is compatible with nvidia
> binary drivers, but I've always had problems with very occasional

<hint> The PVOPS kernel works with the nouveau driver</hint>
Look at http://wiki.xensource.com/xenwiki/XenPVOPSDRM for details.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.