[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Dom0 crashes without logging lately on Debian Stretch with Xen 4.8



On 11/6/18 7:10 PM, John Naggets wrote:
Thanks to both of you for your detailed information. So as you both do not have the intel-microcode package installed it can't be that the issue. I do not make use of that package either myself. So what is left? Well it looks like I am running on older hardware, at least 5 years old hardware and who knows if this has some kind of influence. It might be interesting to get in touch with the hardware manufacturer (DELL?) and ask them if they have other customers with this issue. The only problem here is that as soon as you mention Debian they will stop listening to you :( If I remember correctly they only take support cases for supported commercial Linux distributions which basically boils down to RHEL and SLES... Maybe the DELL forums would be a better alternative. I would definitely recommend filling a bug issue with Debian and maybe even Xen... If you have some kind of stack trace that would also be interesting to see.


Hi all.

We also use XEN on Debian Strech here is the info.


Server 1: DELL T330 4 CPU about 2.5 years with Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz

Latest XEN package from debian intel-microcode 3.20180807a.1~deb9u1 with kernel 4.9.110-3+deb9u6, Domu with a mix of strech and jessie with kernels 3.16.59-1 and 4.9.110-3+deb9u6.

This one is stable.


Server 2. DELL R740 6 months old with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

Latest XEN package from debian intel-microcode (3.20180807a.1~deb9u1)  with kernel 4.9.110-3+deb9u4, Domu with a mix of strech and jessie with kernels 3.16.59-1 and 4.9.110-3+deb9u6.

This one is stable.


Server 3. LENOVO RD650 about 4 years old with Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz with kernel 4.9.110-3+deb9u4

Latest XEN package from debian intel-microcode 3.20180703.2~deb9u1, Domu with a mix of strech and jessie with kernels 3.16.59-1 and 4.9.110-3+deb9u6 and Centos kernel 4.10.

This one is stable.


On all XEN Dom0 server have we put GRUB_CMDLINE_XEN_DEFAULT="dom0_mem=2048M,max:2048M" and sched-credit to 512 on dom0.


xl sched-credit
Name                                ID Weight  Cap
Domain-0                             0    512    0


Best regards Johnny




J.

On Tue, Nov 6, 2018 at 9:37 AM Roalt Zijlstra | webpower <roalt.zijlstra@xxxxxxxxxxx <mailto:roalt.zijlstra@xxxxxxxxxxx>> wrote:

    Hi John,

    Yes, we are using PV only and we only run Debian Linux on the
    servers. We still have some DomU Jessie servers running with the
    stock kernel. We did update our Dells to the latest firmware so it
    does include more recent intel microcode with that. But on Debian
    we did not yet enable the intel-firmware yet, since we had so much
    instability and so much parameters that could be the culprit, we
    did not want to add another.
    If your server is very busy, I think the chance to have a crash is
    higher. We have seen crashes on our active MySQL databases whereas
    the slave MySQL database server did not crash that quickly,
    however after using the slave MySQL database as primary database
    for a while (because we were debugging the crashed master
    database) it could very well happen that the slave would crash too.

    We have done tests with downgrading firmware of Dell (which also
    means using an older intel microcode) but that did not help. So
    having the latest firmware is okay.
    We are now testing a few scenarios:

      *  one server with an older kernel (4.9.0-4-amd64), with DomU
        3.16 kernel, which runs for 16 days now
      *  one server with the updated -kernel (4.9.0-8-amd64), with
        DomU 3.16 kernel, which runs for 28 days now surprisingly
      *  one server with the updated -kernel (4.9.0-8-amd64), and all
        DomUs on the backported 4.9 kernel.

    It all doesn't really make much sense. We do have the expectation
    that the older kernel will keep on running and that the 4.9 DomUs
    will help to keep the servers alive.
    We have tested with 4.14 and 4.16 kernels (from backports) but
    that did not make a difference in stability.

    Best regards,

    [Naam]              Roalt Zijlstra
                Teamleader Infra & Deliverability
                
    [Email]             roalt.zijlstra@xxxxxxxxxxx
    <mailto:roalt.zijlstra@xxxxxxxxxxx>
    [Phone]             +31 342 423 262
    [Skype]             roalt.zijlstra
    [Phone]             https://www.webpower-group.com
    <https://www.webpower-group.com/>
        
        

    [Webpower] <https://www.webpower-group.com/>
    Facebook <https://www.facebook.com/webpower.marketingautomation/>
        Twitter <https://twitter.com/webpower>            Linkedin
    <https://www.linkedin.com/company/36782/>


        Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
    Hamburg | Shanghai | Shenzhen | Stockholm   


        
<https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>



    Op ma 5 nov. 2018 om 18:24 schreef John Naggets
    <hostingnuggets@xxxxxxxxx <mailto:hostingnuggets@xxxxxxxxx>>:

        It could be as you mention... your domU are they PV? I am
        using paravirtualization exclusively and on this specific
        server have the following CPU:

        Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

        Do you have the intel-microcode Debian package from the
        non-free repo installed on your servers? I currently don't...

        J.


        On Mon, Nov 5, 2018 at 3:04 PM Roalt Zijlstra | webpower
        <roalt.zijlstra@xxxxxxxxxxx
        <mailto:roalt.zijlstra@xxxxxxxxxxx>> wrote:

            Hi John,

            It could very well be that it is also restricted to some
            CPUs, but I am inclinded to believe that the used DomU
            kernels can influence stability.  We did have a pretty
            busy SSL offloader running on a 3.16 kernel, which might
            have caused the crashes.

            Just for reference, we have the following two CPUs causing
            us trouble, but I am not sure if it matters.
            Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
            Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz

            Roalt


            Op ma 5 nov. 2018 om 10:45 schreef John Naggets
            <hostingnuggets@xxxxxxxxx <mailto:hostingnuggets@xxxxxxxxx>>:

                Hi,

                Thanks for your feedback. I was wondering because I
                have just upgraded a Debian 9 server to the latest
                kernel with the latest Xen packages from the official
                Debian repo. The only difference is that I have an
                older IBM server which is already ~7 years old patched
                with the latest BIOS/UEFI and so far so good no crash.
                The uptime is 6 days for now. Here are the details
                about my kernel and xen packages.

                ii  xen-hypervisor-4.8-amd64
                4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64       
                Xen Hypervisor on AMD64
                ii  linux-image-4.9.0-8-amd64 4.9.110-3+deb9u6
                amd64        Linux 4.9 for 64-bit PCs

                Regards,
                J.


                On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen
                <volker@xxxxxxxxxx> wrote:

                    Hi John,

                    the problem is that I cannot provide any metrics
                    or logfiles showing an error. I can only tell that
                    dom0 is rebooting for a reason that is not logged.
                    I have no physical access to the server. I got one
                    other report about this kind of issue.

                    My assumption the cause are the backported patches
                    is based on the current 16 day uptime. 16 days ago
                    the server rebooted every 3-5 days. It won’t be a
                    useful bug report from my point of view.

                    The other thing is that my two servers are now
                    running upstream Xen and kernel and I might not go
                    back to both old versions in Debian stretch. The
                    other server had always running upstream versions
                    and had never a problem, that’s why I updated the
                    other, too.


                    Best regards
                        Volker


                    Am 02.11.2018 um 17:23 schrieb John Naggets
                    <hostingnuggets@xxxxxxxxx
                    <mailto:hostingnuggets@xxxxxxxxx>>:

                    I was wondering if any of you guys reported this
                    bug/issue/problem back to the Debian community?
                    For example on their bugs.debian org web site?

                    On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen
                    <volker@xxxxxxxxxx <mailto:volker@xxxxxxxxxx>> wrote:

                        Hi,

                        I had these crash problems with the Xen
                        version in Debian stretch, too. After 3 to 7
                        days the Xen server rebooted without log
                        entry or something else to observe. The
                        problems started when the first patches were
                        applied by Debian. Some updates made it
                        better, the last worse again. I checked hard
                        drives, RAM and closely monitored metrics
                        what might be the cause.

                        My solution after no longer suspecting a
                        hardware fault: build upstream Xen 4.11 for
                        Debian stretch. I am currently running this
                        setup with my own build of kernel 4.19. The
                        machines are now working stable again.


                            Volker


                        Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra
                        | webpower <roalt.zijlstra@xxxxxxxxxxx
                        <mailto:roalt.zijlstra@xxxxxxxxxxx>>:

                        Hi there,

                        Ever since all the Meltdown and Spectre
                        kernel updates and possibly also Xen 4.8
                        updates, we experience crashes of the Dom0
                        just out of the blue. Sometimes after 1 day,
                        sometimes after a few days or even 14 days,
                        completely random.

                        We have two Dell P730 servers and two Dell
                        P720 servers with this behaviour. One thing
                        is that we updated these machine to the
                        latest available firmware, because that is
                        the most secure way. Then we installed
                        Debian Stretch with Xen 4.8 support

                        We have done serveral installs and 4 servers
                        seem to crash pretty fast and other don't.
                        In the end we think that we can lead it back
                        to the xen-4.8.4-pre version being stable
                        and the xen-4.8.5-pre being unstable. This
                        was kinda independent of the kernel that we
                        were using 4.14 or 4.9.0-8-amd64. This is
                        off course all Debian package numbering.

                        As last resort  we updated on one server all
                        DomU kernels of our Jessie servers on this
                        Dom0 to 4.9.0 from backports instead of the
                        3.16 kernel. For now that seems to work, but
                        the crashes are random so it could happen
                        any time again. The idea is that these
                        kernels are completely spectre& meltdown
                        unaware and might cause trouble in Xen
                        kernel support. I am not sure if this is
                        true at all, but we are pretty lost what the
                        actual cause is.

                        We also tested with CentOS and we also had
                        these crashes there with certain
                        combinations of kernel/Xen. The most recent
                        updates seem to be more stable tough. The
                        most frustrating part is the there is
                        absolutely no logs to be found. No kernel
                        oops or what.. the server just resets and
                        boots again.

                        Are there others experiencing problems like
                        this? Do you see more frequent server/kernel
                        crashes on production servers?

                        Best regards,

                        Roalt Zijlstra

                        _______________________________________________
                        Xen-users mailing list
                        Xen-users@xxxxxxxxxxxxxxxxxxxx
                        <mailto:Xen-users@xxxxxxxxxxxxxxxxxxxx>
                        https://lists.xenproject.org/mailman/listinfo/xen-users
                        _______________________________________________
                        Xen-users mailing list
                        Xen-users@xxxxxxxxxxxxxxxxxxxx
                        <mailto:Xen-users@xxxxxxxxxxxxxxxxxxxx>
                        https://lists.xenproject.org/mailman/listinfo/xen-users

                    _______________________________________________
                    Xen-users mailing list
                    Xen-users@xxxxxxxxxxxxxxxxxxxx
                    <mailto:Xen-users@xxxxxxxxxxxxxxxxxxxx>
                    https://lists.xenproject.org/mailman/listinfo/xen-users

                _______________________________________________
                Xen-users mailing list
                Xen-users@xxxxxxxxxxxxxxxxxxxx
                <mailto:Xen-users@xxxxxxxxxxxxxxxxxxxx>
                https://lists.xenproject.org/mailman/listinfo/xen-users


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.