I'm not sure whether this would be better titled "Why would Nagios need to monitor a load reaching 30".
Situation:
I am setting up Nagios for our network and have reached the stage of setting up NRPE on the *nix boxes. I had already (on paper) gotten a rough idea of where I wanted notifications set up. For a particular server, as an example, it looks like this:
1 minute: warn at 90%, crit at 100%
5 minutes: warn at 80%, crit at 90%
15 minutes: warn at 60%, crit at 70%
The server runs two virtual cpus so I plan to use the -r parameter to get a per-cpu result (yeah I know this isn't really per cpu, it's the load for all of them divided by the number of them and I am OK with that).
so I was absolutely ready to set this up, when I saw the defaults on the NRPE config file:
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
This put me off. I started wondering if I really understand load averages. I see that the -r parameter is not used and so load averages above 1 are normal, but does this suggest the default there is for a 30-cpu system? I saw this question for which the answer suggests using [number of cpu's] * 10 for the critical 5 minute notification (one minute maybe?) which further supports the use of values far higher than I planned. I mean, without seeing the defaults there I would have gone with
command[check_load]=/usr/lib/nagios/plugins/check_load -r -w 0.9,0.8,0.6 -c 1.0,0.9,0.7
but now I am doubtful. I know that no one from the internet can tell me the correct values to use for our situation and I do not expect anyone to, I would be very thankful if someone can tell me whether or not I grossly misunderstand load and need to start my detective work on useful values again. For what it is worth, I got those values just based on having run top
every once in a while for the past 6 months on the server in question. Usually it sits between .4 per cpu (.8) and .55 per cpu (1.1) for 1 minute avg.
No comments:
Post a Comment