[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Dom0 crashes without logging lately on Debian Stretch with Xen 4.8


  • To: xen-users@xxxxxxxxxxxxxxxxxxxx
  • From: Michael <delajamal@xxxxxx>
  • Date: Tue, 6 Nov 2018 10:08:02 +0100
  • Autocrypt: addr=mjs@xxxxxxxxxx; keydata= xsFNBFRPcX4BEACx8zwNH8NYu57EJS81DMf2JG9t90gu4M3ovbGjj86SQt7j0qw02aVIIOw+ w3++9wv9Wgi/2XahKWRoEaablILwE1jlo2sGeNSmRTbOB6uUYsO8b9gTjgGKYsMK1wg1DEM1 5wQExCs6nTTMkwDekPrclRPmDFBN1SEUXlGSR/u3meMovsJRZD0Iy/apAEaBf7XJgGNGQMht mVsO4jS/X/0p7q3njRFgo9KZL0OCqRUDRcENI07lJY3HILY0wLKbAxnj80Cvz/EYSq/jSjYB YfQ3YA3FIXx0POEfNLEvXctEqXanfNkFLRki5LHd1RTNjRXynu6IHzDtAC4VwhjoUA9JFVZj g2qp0SDGXIA4b3rWlxtfUMdYVfr4z46h7AH0nWsxfCaoLSCwvE0u9UgQq+ZbaSNDXz+tsbFs oYY2qWdvGPPwWXh2R3i0t93SElKrZVHt9OhUCRJKfQKiuGoaigLDN/asyS0bqfw0olNUcsF2 ai00WHUIzKO15nyuObHlJ0747Oork7+Xn9vk9nARB4IYSFgRwD3Ruiur1K8ZhDWRYENd6uQ1 qZ5S2Q3NNJyH3LWrqjdMraxtp9okVuPccrBurzSK1aqzS2XukHYR0Lzt5jeAvQG5l3FsyEXj hNVJBo242mMp9UKEUjqDVTXxTEUCiqwWsZLRC9ouGcIJxHPb4QARAQABzSFNaWNoYWVsIFN0 aWVnbGVyIDxtanNAY2VwaGVpLmNvbT7CwZYEEwEIAEACGyMHCwkIBwMCAQYVCAIJCgsEFgID AQIeAQIXgBYhBP6kJTifkSbfbn6+g1772gvxRt6VBQJbj3azBQllPBQ1AAoJEF772gvxRt6V DwEQAItTtMOUDQvQrB0hp0gbMpdaph50pFGSuQlneMl+5oQlVXkHRR8CcQAhDppQ7Yda7a1U 2s2z/QStzhJuiYlw94rSXKVbUwXYdOAKDqjLclJDn72+hHHkEmRYmUZ1zLprsTjV/EuuhnQO BXEumj17xenav3yMwH41eN67TIvLOfvt18a+ZpF4q2CaoVQ5Lqmcszr1I9NIrHqkkKG/+n1V GV3qEIWR11/WCNLyvUscyApLdWRTBMftRaMnFrkZ/kYpea9KymUYr7VhjLGqaPdWf6zQ4Qa7 3MDVxICPU4NqryspeGh5gPRX5t4CDDx9QqzEXrNuYDd0rnG96XIzVWZ7TDctmP9eLfmpHCzI DASM1Ubnf0HmCC0TE7Z+8a7y+Fti3Eu8I+4bHkcTDNLEf/lCVIUO0SScVERy1YJJFNWKq2+8 lq+JY4bX/kTRw8zEMXe4VmzXZ9SQ0FpCfrJM8kdRJP+ujHmojM36TqEGYGuufo4MQTC4X027 WLoUxn/0tMWMYICcnjnMYDN1uFwcYaPQ2KLyMXiVOcueL/Pv9m7FwtW2YJl4HfQHzfoc/ki6 hb5pd43Lxo0QEiqaJ/xWSN84IhaMrGWig9nsZJv1BBKr+2w0n0pBCcdBudan9ErUkNq9Sfke ml5Cm+sCFp+HVfzAGCfEtza28dZPHZhzcOoSg+thzsFNBFRPcX4BEAC1FIhNiNvlx8+Pc69X eh2jHumTosiu5D5Li+PaxbazerxaqYPZe6z5f39iFDQycLKCOauDyybAMmydmVztUrLBCag+ SPr2yWQaEJIaOwdSqPlBv0zJHrEu7vIZ+9i6C3cIiXSrfBVxEaAiurhl7WWpVaSxO7t7ya1B RsKSOY6yttRsAMCm5Tu8GyNoRCbh3+7qIyaYwVpbJETgowgZU68u9TOMnkG1fE0BlJb7qbCh fcXLJqBmj7R3xfCVMhXmyQ8PxXLUKwQKguGej46QzQlRjeQYABMRUkWPg//h3QfJlQmUW97k FAyV9gNwP+FsCfKx0mTON+iGheiV/0W3PQZ3+3J/i7LxtqixGrw0aPNXymvmxYOmBeNBTk0V 13IhcZXyW/r+E8lT6SYPx4PGSRNhahYns0TsE1TMTlNgjz7PibpBopOq0RnPs4cRnMCdt/Cs H9VW3TQMvZR6CgCk502YvPn7G82lDXLntU/fHxDksT3XRl+aWtluaLKNHjnRx8MRUn3QU9kL lAVzoROpWIKhdsM/BckXran/+DY/A30n8z3OUaEy3RZpadDZGJEuF+FoOYs+UDlq+YKBQt4Z 8gCUnx41KuVp9JxupyXMaK2uROzNF8KAZ4dRkzB/B42gHlmfKBb0pz9xkpX7xoBtihzcFeqx wsUPRTne/PMnZxgLXwARAQABwsF8BBgBCAAmAhsMFiEE/qQlOJ+RJt9ufr6DXvvaC/FG3pUF AluPdrMFCWU8FDUACgkQXvvaC/FG3pVExg//ZTy+3kGrhWPfKa96i4ET3PcG84PjcZZVNhPQ Crp253GJWw4sUk+6O94Z0IUdtUSrQHvxdkkQn8FCFP6SZaZVjpd/bcfO6FSc6xoMK9YRHPl7 PYa20uUzXnldJCXYdGXCiBWAj0igTdTFaAbNIruHE7lwIUq2lMwtBzLH5nJqPgxCEcWFRtg4 aDwtyLncrNXLVx8zXDlVhsaafU3O8bMJOzr20otFf2LGBWy1w+PaA5io3/4YOkhcLZj36a6T /M3BXjRfSLHYyg7xgTUvhx47LK0Fxb4T4oM6e+dPTqQO0HPFYJubpUH3Fy717SptFVlrTG5O FFHkGMYu4D7AMwflIRMEMiuR3cuMkYnW65kz8W7aWinuqEwwuB+NCda5r/Cct7eBTvoO9avi a1DlRMlDmhuoV2diReiDPy+GZdPAh4CTNhmGh3oohVLYmGlC9vmUR7lFpJxLFIEpJXGgqvRI ZCQwH4BD2vSlvvi0OpCmBGt0X7LP0qREqS1Bkpk/egGIod7gNIlEeXfuSEdOtqqgQzYqmGm5 Pk1DKdaFpen1AJgVOFghgL9k/aq9ZNtymk7MXlk2PJv0W3rcbb2tEgHIM7R4MbPGDIfLR79n zrIrgTrNPBWM5q0inWGNwUfDag6mn9U1Ou5k2vrXGmUggQJA/8HEDqPsZy85Vx6uyaws94A=
  • Delivery-date: Tue, 06 Nov 2018 09:09:04 +0000
  • List-id: Xen user discussion <xen-users.lists.xenproject.org>
  • Openpgp: preference=signencrypt

Hello,


i had the same Issues.
In my case i tried
Ubuntu 18.04 with xen 4.9 and the Kernel Version 4.15.9 was the only one wo has start up the DomU.

Tested on AMD Ryzen 1800X and Intel 8700.

In my case i got random system freezes Uptimes between 7 and 30 Days.

Older and never Kernels wont run.
This Problem is still present, i going to switch all Services to Docker...

Regards,
Michael





Am 06.11.2018 um 09:37 schrieb Roalt Zijlstra | webpower:
Hi John,

Yes, we are using PV only and we only run Debian Linux on the servers. We still have some DomU Jessie servers running with the stock kernel. We did update our Dells to the latest firmware so it does include more recent intel microcode with that. But on Debian we did not yet enable the intel-firmware yet, since we had so much instability and so much parameters that could be the culprit, we did not want to add another.
If your server is very busy, I think the chance to have a crash is higher. We have seen crashes on our active MySQL databases whereas the slave MySQL database server did not crash that quickly, however after using the slave MySQL database as primary database for a while (because we were debugging the crashed master database) it could very well happen that the slave would crash too.

We have done tests with downgrading firmware of Dell (which also means using an older intel microcode) but that did not help. So having the latest firmware is okay. 
We are now testing a few scenarios:
  •  one server with an older kernel (4.9.0-4-amd64), with DomU 3.16 kernel, which runs for 16 days now
  •  one server with the updated -kernel (4.9.0-8-amd64), with DomU 3.16 kernel, which runs for 28 days now surprisingly
  •  one server with the updated -kernel (4.9.0-8-amd64), and all DomUs on the backported 4.9 kernel.
It all doesn't really make much sense. We do have the expectation that the older kernel will keep on running and that the 4.9 DomUs will help to keep the servers alive. 
We have tested with 4.14 and 4.16 kernels (from backports) but that did not make a difference in stability.

Best regards,
 
  Roalt Zijlstra
    Teamleader Infra & Deliverability
     
  roalt.zijlstra@xxxxxxxxxxx
  +31 342 423 262
  roalt.zijlstra
  https://www.webpower-group.com
 

 
Facebook   Twitter   Linkedin

Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
Hamburg | Shanghai | Shenzhen | Stockholm
 



Op ma 5 nov. 2018 om 18:24 schreef John Naggets <hostingnuggets@xxxxxxxxx>:
It could be as you mention... your domU are they PV? I am using paravirtualization exclusively and on this specific server have the following CPU:

Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz

Do you have the intel-microcode Debian package from the non-free repo installed on your servers? I currently don't...

J.


On Mon, Nov 5, 2018 at 3:04 PM Roalt Zijlstra | webpower <roalt.zijlstra@xxxxxxxxxxx> wrote:
Hi John,

It could very well be that it is also restricted to some CPUs, but I am inclinded to believe that the used DomU kernels can influence stability.  We did have a pretty busy SSL offloader running on a 3.16 kernel, which might have caused the crashes. 

Just for reference, we have the following two CPUs causing us trouble, but I am not sure if it matters.
Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz

Roalt


Op ma 5 nov. 2018 om 10:45 schreef John Naggets <hostingnuggets@xxxxxxxxx>:
Hi,

Thanks for your feedback. I was wondering because I have just upgraded a Debian 9 server to the latest kernel with the latest Xen packages from the official Debian repo. The only difference is that I have an older IBM server which is already ~7 years old patched with the latest BIOS/UEFI and so far so good no crash. The uptime is 6 days for now. Here are the details about my kernel and xen packages.

ii  xen-hypervisor-4.8-amd64       4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64        Xen Hypervisor on AMD64
ii  linux-image-4.9.0-8-amd64      4.9.110-3+deb9u6                         amd64        Linux 4.9 for 64-bit PCs

Regards,
J.


On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen <volker@xxxxxxxxxx> wrote:
Hi John,

the problem is that I cannot provide any metrics or logfiles showing an error. I can only tell that dom0 is rebooting for a reason that is not logged. I have no physical access to the server. I got one other report about this kind of issue.

My assumption the cause are the backported patches is based on the current 16 day uptime. 16 days ago the server rebooted every 3-5 days. It won’t be a useful bug report from my point of view.

The other thing is that my two servers are now running upstream Xen and kernel and I might not go back to both old versions in Debian stretch. The other server had always running upstream versions and had never a problem, that’s why I updated the other, too.


Best regards
    Volker


Am 02.11.2018 um 17:23 schrieb John Naggets <hostingnuggets@xxxxxxxxx>:

I was wondering if any of you guys reported this bug/issue/problem back to the Debian community? For example on their bugs.debian org web site?

On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <volker@xxxxxxxxxx> wrote:
Hi,

I had these crash problems with the Xen version in Debian stretch, too. After 3 to 7 days the Xen server rebooted without log entry or something else to observe. The problems started when the first patches were applied by Debian. Some updates made it better, the last worse again. I checked hard drives, RAM and closely monitored metrics what might be the cause.

My solution after no longer suspecting a hardware fault: build upstream Xen 4.11 for Debian stretch. I am currently running this setup with my own build of kernel 4.19. The machines are now working stable again.


    Volker


Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <roalt.zijlstra@xxxxxxxxxxx>:

Hi there,

Ever since all the Meltdown and Spectre kernel updates and possibly also Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue. Sometimes after 1 day, sometimes after a few days or even 14 days, completely random.

We have two Dell P730 servers and two Dell P720 servers with this behaviour. One thing is that we updated these machine to the latest available firmware, because that is the most secure way. Then we installed Debian Stretch with Xen 4.8 support

We have done serveral installs and 4 servers seem to crash pretty fast and other don't. In the end we think that we can lead it back to the xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable. This was kinda independent of the kernel that we were using 4.14 or 4.9.0-8-amd64. This is off course all Debian package numbering.

As last resort  we updated on one server all DomU kernels of our Jessie servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel. For now that seems to work, but the crashes are random so it could happen any time again. The idea is that these kernels are completely spectre& meltdown unaware and might cause trouble in Xen kernel support. I am not sure if this is true at all, but we are pretty lost what the actual cause is.

We also tested with CentOS and we also had these crashes there with certain combinations of kernel/Xen. The most recent updates seem to be more stable tough. The most frustrating part is the there is absolutely no logs to be found. No kernel oops or what.. the server just resets and boots again.

Are there others experiencing problems like this? Do you see more frequent server/kernel crashes on production servers?  

Best regards,
 
Roalt Zijlstra

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.