[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: domU memory exceeded =?=> spontaneous reboots



Elliott Mitchell wrote:
> > The command that I
> > ran in it that triggered the reboot was `kubectl delete -f` of a Deployment
> > that was already running from an `apply`.
> 
> Okay, do you have a full list of what this command does?

I'm sort of counting on people to know what Kubernetes does.  I know that the
command will contact the k8s API Server on the control node, then that will
cause some cascade of communication among the k8s services running on the
control node, eventually leading the API Server to contact the worker node's
Kubelet to delete the objects.

> Might it cause a crucial Xen domain to panic (domain 0) and this in turn
> cause Xen to panic?

The only way that I could see the dom0 involved is maybe via the network
stack?  The communication between the domUs is going through the xenbr0, of
course.

> How much free memory does Xen have?

265.  It was sthg like 313 (= 265 + 2048 - 2000) before I moved more memory
to the control node.

> Might be 0 if Xen is ballooning memory from domain 0 to handle
> allocations.  If ballooning memory from domain 0 has been disabled this
> should stay above 50 so Xen can allocate memory to handle activity.

No ballooning:
dom0_mem=4G;max:4G dom0_max_vcpus=4 dom0_vcpus_pin xpti=dom0=false,domu=true 
no-real-mode edd=off

> > The domUs are a k8s control and worker node,
> 
> Is either of these also domain 0?  Domain 0 exhausting its free memory
> and panicing might cause the issue you're describing.

No, they are domUs.

> > Intel Core i9-14900T
> 
> Apparently there is a major issue with 14900K processors.  I've been
> reading mentions of other Intel 13xxx and 14xxx chips reputedly having
> failures at a lower rates.

The K and KS processors are designed to allow them to run at high speed,
leading to temps of 90-100C that destroy the silicon.  The T series are
low power.  Stress testing never went above 60C core temps in my
configuration.

> Right now there could still be configuration issues, but I would keep an
> eye out for hardware failure.

It's been running since June or July.  I'm pretty confident that the
hardware has been shaken down.  The BIOS upgrade is the only thing I have
my eye on.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.