Xen project Mailing List

[Xen-users] How are you measuring CPU usage?

From: Andy Smith <andy@xxxxxxxxxxxxxx>

Date: Sun, 16 Sep 2012 12:23:26 +0000

Delivery-date: Sun, 16 Sep 2012 12:24:55 +0000

List-id: Xen user discussion <xen-users.lists.xen.org>

Openpgp: id=BF15490B; url=http://strugglers.net/~andy/pubkey.asc

Hello, Recently I've received some complaints that there is excessive but intermittent latency of network traffic to domUs on some of my servers. Upon investigation it seems that on some servers, indeed traffic is occasionally delayed by up to 140ms where something like 5ms RTT would be expected. The average RTT is not unusual; since this is only occasional packets it only affects the worst case and standard deviation. What seems likely is that these servers are overloaded for CPU. I have tried the various tweaks of the credit scheduler but the fact remains that the credit scheduler has a 30ms time slice, so I believe that when the server is so loaded that domUs are competing for CPU time, I could expect that a CPU hog gets the CPu for 30ms before handing over to a non-hog who gets a 30ms penalty on an RTT measurement. Clearly the answer is to not overload the servers, and it's one I completely agree with. However, I am not sure how best to measure this. I do not have control over what the domUs do, so their CPU usage profile can change. I need to monitor this in order to know when I need to move or restrict a domU. I need to know when there is actual overloading taking place, without having to measure network traffic RTT. For a long while I've been measuring CPU usage for the entirety of a physical piece of hardware by watching the CPU time counters as displayed by "xm list --long". By feeding that into stats software like MRTG or Cacti, that gives me the time used per period which in turn gives me the percentage of CPU used by every domU and the dom0. Using the above method, one of my possibly overloaded servers shows about 87% average CPU usage. Up until now, I thought that was acceptable. I think the problem is that this is based on 5 minute averages. Notably the problems most often happen at the top of the hour and on 5 minute intervals, and also at 4am. Sounds like typical cron job frequencies, right? It's not caused by cron jobs on the dom0 (there are almost none, and I disabled them all to verify). I'm thinking that at the top of the hour and sometimes at 5 minute intervals there are several domUs competing for CPU for a short amount of time, and not getting it. This is being averaged away over the 5 minute span so as to appear reasonable even though it's actually causing some problems. So, how are other people monitoring their CPU usage? Are you doing more frequent polls such as every minute or even more frequent than that? Is there a better way to read a domU's CPU time counter than parsing the output of "xm list --long"? Is it available cheaply from somewhere in /sys or is there an API or anything? Cheers, Andy _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.