Watch out for slow host check commands!
We had a customer where we were installing Opsview into a new datacentre while it was being built. Consequently, there are lots of hosts that were down.
Opsview was configured as a slave to receive snmptraps. However, the slave server eventually crashed as there were too many nagios processes running.
This was a sample ps -ef | grep nagios:
UID PID PPID C STIME TTY TIME CMD nagios 1997 1 0 10:04 ? 00:00:00 /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d nagios 6176 1 0 10:21 ? 00:00:00 /usr/sbin/snmpd -u nagios -Lsd -Lf /dev/null -p /var/run/snmpd.pid root 10684 1 0 15:01 ? 00:00:27 /usr/sbin/snmptrapd -t -m ALL -M /usr/share/snmp/mibs:/usr/local/nagios/snmp/load -p /var/run/snmptrapd.pid nagios 26437 1 0 15:57 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 28364 1 0 15:58 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 28720 1 0 15:58 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 29159 1 0 15:59 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 29574 1 0 15:59 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 30038 1 0 16:00 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 30420 1 0 16:00 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 30890 1 0 16:01 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 31331 1 0 16:02 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 31642 1 0 16:02 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 32352 1 0 16:03 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 498 1 0 16:03 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 1008 1 0 16:04 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 1096 1 0 16:04 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 1629 1 0 16:04 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 2365 1 0 16:05 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 2737 1 0 16:05 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 3057 26437 0 16:06 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg nagios 3058 3057 0 16:06 ? 00:00:00 /usr/local/nagios/libexec/check_ping -H monitoredhost53.XXX.XX.XXXXXX -w 3000.0,80% -c 5000.0,100% -p 6 nagios 3059 3058 0 16:06 ? 00:00:00 /bin/ping -n -U -c 6 monitoredhost53.XXX.XX.XXXXXX root 3117 10684 0 16:06 ? 00:00:00 /usr/bin/perl /usr/local/nagios/bin/snmptrap2nagios root 3118 2785 0 16:06 pts/0 00:00:00 ps -ef
Notice the high number of nagios processes. They were being created at the rate of 2 per minute. If we changed the service_check_timeout parameter in nagios.cfg to 15, then we would have a rate of 4 a minute.
Process id 26437 is the main nagios process because 3057 is spawned off to do a host check.
If we stopped and restarted nagios, all the other processes disappeared but then after about 10 minutes, we would get a build up again.
We ran an strace on one of the processes and got this:
[root@opsviewserver scripts]# strace -p 30038 Process 30038 attached - interrupt to quit write(6, "firewall01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 496Process 30038 detached
firewall01 happened to be a firewall server which was sending lots of snmp traps with events about blocked ports. Our theory was that Nagios couldn't process the traps quick enough. There is a portion of the Nagios code which says if it cannot pass information back to the main Nagios process via the parent/child pipe, back off and try again. These were the spinning Nagios processes which were being created.
So where was our bottleneck? Because this datacentre had lots of servers down, we thought that Nagios must be processing a lot of host checks. Our host check command was setup to try a check_ping -p 6 to see if the server was alive. That means ping will send 6 packets before check_ping returns a result to Nagios. This takes over 5 seconds to run.
Here was the bottleneck. In Nagios 2, host checks are serialised so other parts of Nagios stop until a host result is determined.
We had a massive performance boost when we changed our host check command to check_ping -p 1. This runs in less than a tenth of a second.
The Nagios docs even warn you about this too. Shame we didn't appreciate it earlier.
So the lesson is: make sure the host check command executes quickly.
In Nagios 3.0, host checks are going to be parallelised. In fact, this is already in CVS.
There are alternatives to check_ping. Some people swear by check_icmp, others by check_fping. We personally prefer check_ping. Just make sure you use the -p 1 option.
Recent Comments