Xen project Mailing List

Re: domU memory exceeded =?=> spontaneous reboots

From: Mike <debian@xxxxxxxxxxxxxxxxxxxxx>

Date: Fri, 6 Dec 2024 06:18:09 +0000

Cc: Paul Leiber <paul@xxxxxxxxxxxxxxxx>, xen-users@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Fri, 06 Dec 2024 06:24:06 +0000

List-id: Xen user discussion <xen-users.lists.xenproject.org>

Elliott Mitchell wrote: > > The command that I > > ran in it that triggered the reboot was `kubectl delete -f` of a Deployment > > that was already running from an `apply`. > > Okay, do you have a full list of what this command does? I'm sort of counting on people to know what Kubernetes does. I know that the command will contact the k8s API Server on the control node, then that will cause some cascade of communication among the k8s services running on the control node, eventually leading the API Server to contact the worker node's Kubelet to delete the objects. > Might it cause a crucial Xen domain to panic (domain 0) and this in turn > cause Xen to panic? The only way that I could see the dom0 involved is maybe via the network stack? The communication between the domUs is going through the xenbr0, of course. > How much free memory does Xen have? 265. It was sthg like 313 (= 265 + 2048 - 2000) before I moved more memory to the control node. > Might be 0 if Xen is ballooning memory from domain 0 to handle > allocations. If ballooning memory from domain 0 has been disabled this > should stay above 50 so Xen can allocate memory to handle activity. No ballooning: dom0_mem=4G;max:4G dom0_max_vcpus=4 dom0_vcpus_pin xpti=dom0=false,domu=true no-real-mode edd=off > > The domUs are a k8s control and worker node, > > Is either of these also domain 0? Domain 0 exhausting its free memory > and panicing might cause the issue you're describing. No, they are domUs. > > Intel Core i9-14900T > > Apparently there is a major issue with 14900K processors. I've been > reading mentions of other Intel 13xxx and 14xxx chips reputedly having > failures at a lower rates. The K and KS processors are designed to allow them to run at high speed, leading to temps of 90-100C that destroy the silicon. The T series are low power. Stress testing never went above 60C core temps in my configuration. > Right now there could still be configuration issues, but I would keep an > eye out for hardware failure. It's been running since June or July. I'm pretty confident that the hardware has been shaken down. The BIOS upgrade is the only thing I have my eye on.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.