[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Debian 10, xen 4.11 reliability



On 7/16/20 2:34 PM, Hans van Kranenburg wrote:

You're not running Debian Xen packages apparently, so I can't say much
about that part.

But, is that Linux 4.9 in the dom0? Begin by eliminating that.

We've been running Linux 4.9 for a long time, though we plan to upgrade soon.

The timing does not correlate, and far less than one percent of our users are 
having issues.

Our
milage may vary, but at work, we skipped from Jessie to Buster (well,
actually to our own strech-backports) because I really could not get
anything working with Linux 4.9 as dom0 kernel after the whole
Spectre/Meltdown stuff unfolded. We never got to the bottom of it, due
to a big lack of time and kernel debugging knowledge/experience, but
what I have seen is random Oopses, disk corruption and other things.

There were panics in the dom0 which I traced to a network driver, and I fixed 
it.

This is the first time we've had complaints of file system corruption.

Are you using live migration?

Not so recently that it would have affected the two systems with problems.


So, why not get those dom0s to latest Xen 4.11 packages from Debian and
Linux 4.19? It's flying here, with several clusters of dozens of servers
and a few dozen TiB of mems, running thousands of domUs, without any
problem.

Are your dom0's running the latest kernel version? Are they running ext3? What 
uptime have they had?

What about the domU's?


I agree with Ben that using ext3 nowadays should be discouraged because
of the amount of usage and testing decreasing.

Yes. I think Debian and Ubuntu are the only distributions where we might have users who are using an old file system with a new kernel, which is why I'm focused on ext3. But I can't say for certain.

But, I might have the luxury of working with a setup where we manage all
of it and have customers look at some GUI and have no idea about the
actual underlying systems. Having customers run anything they want is a
different slice of bread...

It very much is.


Anyway, the above is just some thinking out loud. I know that it's very
difficult to debug these kinds of things, because you need more failures
happening to be able to correlate, and a reliable reproduction scenario
would be the ultimate thing as a start to figure out what's actually
going wrong, but these are really difficult time consuming tasks.

We're trying.

Thanks, Sarah



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.