We are running into a strange behavior where we see high CPU utilization but quite low load average.
The behavior is best illustrated by the following graphs from our monitoring system.
At about 11:57 the CPU utilization goes from 25% to 75%. The load average is not significantly changed.
We run servers with 12 cores with 2 hyper threads each. The OS sees this as 24 CPUs.
The CPU utilization data is collected by running /usr/bin/mpstat 60 1
each minute. The data for the all
row and the %usr
column is shown in the chart above. I am certain this does show the average per CPU data, not the "stacked" utilization. While we see 75% utilization in the chart we see a process showing to use about 2000% "stacked" CPU in top
.
The load average figure is taken from /proc/loadavg
each minute.
uname -a
gives:
Linux ab04 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
Linux dist is Red Hat Enterprise Linux Server release 6.3 (Santiago)
We run a couple of Java web applications under fairly heavy load on the machines, think 100 requests/s per machine.
If I interpret the CPU utilization data correctly, when we have 75% CPU utilization it means that our CPUs are executing a process 75% of the time, on average. However, if our CPUs are busy 75% of the time, shouldn't we see higher load average? How could the CPUs be 75% busy while we only have 2-4 jobs in the run queue?
Are we interpreting our data correctly? What can cause this behavior?
Answer
While Matthew Ife's answer was very helpful and led us in the right direction, it was not exactly the what caused the behavior in our case. In our case we have a multi threaded Java application that uses thread pooling, why no work is done creating the actual tasks.
However, the actual work the threads do is short lived and includes IO waits or synchornization waits. As Matthew mentions in his answer, the load average is sampled by the OS, thus short lived tasks can be missed.
I made a Java program that reproduced the behavior. The following Java class generates a CPU utilization of 28% (650% stacked) on one of our servers. While doing this, the load average is about 1.3. The key here is the sleep() inside the thread, without it the load calculation is correct.
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
public class MultiThreadLoad {
private ThreadPoolExecutor e = new ThreadPoolExecutor(200, 200, 0l, TimeUnit.SECONDS,
new ArrayBlockingQueue(1000), new ThreadPoolExecutor.CallerRunsPolicy());
public void load() {
while (true) {
e.execute(new Runnable() {
@Override
public void run() {
sleep100Ms();
for (long i = 0; i < 5000000l; i++)
;
}
private void sleep100Ms() {
try {
Thread.sleep(100);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
});
}
}
public static void main(String[] args) {
new MultiThreadLoad().load();
}
}
To summarize, the theory is that the threads in our applications idle a lot and then perform short-lived work, why the tasks are not correctly sampled by the load average calculation.
No comments:
Post a Comment