The problem
For one customer, we had a major scaling issue with distributed monitoring and NSCA. The initial setup was one master, 5 slaves using send_nsca to send passive service check results back to the master. This is the standard setup, with the ocsp_command like the submit_check_result script.But we started to see some bad figures in the Nagios performance. The average Check Latency was showing 9.5 seconds, which seemed far too long. On the master, we could see 50+ nsca daemon processes, though they didn't appear to be doing anything.
The revelation
The revelation came when we looked on the slave. At any one time, there was only one send_nsca running! So even though the service checks were being run in parallel, it looked like ocsp_commands were being sent serially. This had to be our bottleneck.The solution
So we wrote a script called send_nsca_cached to cache the passive check results. The idea is that the script will take the results as usual, but write to a cache file instead of running send_nsca. This cache file would hold a start time, so if the current result exceeded the start time + cache period, then send_nsca would be invoked and send all passive results at once.We put the script on the slave and could see that the cache file would fill in spurts - 10 entries looked to be written within half a second, but then nothing for a few seconds. Nagios does some tricks to try and spread the service check load, but I wonder if the "traffic jam" of sending the uncached way was causing the services to be bunched up together.
When we checked again in an hour, the maximum Check Latency dropped to under 1 second and the master had only 9 nsca daemons. And I guess it is much better for network load as well to send a whole bunch of data at once, rather than a single message at a time.
The warnings
There had to be some bad points.- This script is only for Nagios 2.0+ because of the use of environment variables
- We don't support passive host checks. Not sure if this is a good or bad thing
- Do not use this if your slave is not busy. As send_nsca_cached needs to be invoked in order to send results, if your ocsp_command is only invoked once every minute, then the quickest you will get a batch result sent to the master is every minute, regardless of your cache time. So only use this script on a busy slave. You could use a cache time of 0 to be the same as sending immediately
- Don't make the cache time too large. The results have no timestamps, so when Nagios on the master receives the results, it will process it as if the check happened just then. Also, if there is too much data being sent, you could fill the command pipe on the Nagios master
- On that point, make sure the master Nagios server has command_check_interval=-1 in nagios.cfg, so that the command pipe is read as quickly as possible. There are known limitations that if the pipe is filled, processes writing to the pipe will hang until more space is available
The future
That last point about the command pipe is being (partially) addressed in Nagios 3.0. Ethan has said at the Nagios Conference in Germany there will be a new external command called PROCESS_FILE, so the idea is that nsca can drop a file down on the master with a file containing passive check results and then only one command is put into the pipe, which will then process that entire batch.The real solution to point (3) is to let the caching be done at Nagios, rather than externally, and that is also on the radar for Nagios 3.0. So there is lots to look forward to there. But if you want something now, check out our script. It's not a perfect script because it's hard coded in various places and you will need to customise the send_nsca command, but we hope it helps you regardless. Enjoy!
The end?
Not quite. At the Nagios Conference, Ethan was talking to two guys who were complaining that their distributed setup had huge slowdowns. I overheard and the symptoms looked exactly the same, so I gave them a copy of the script. Apparently it helped, but they had some lock ups in Nagios, which they think were attributed to our script - so caveat emptor. They have since reverted back to using the standard uncached mechanism.We haven't had any issues for our customers, so we're interested in what you find. If you have a distributed environment with similar symptoms and you are thinking of using this script, please take a note of your Check Latency and the number of nsca daemons and add a comment to this blog with some before and after statistics. We'd love to know if this works elsewhere. Good luck!
Always nice to see solutions to this problem, if I may I would like to suggest a couple of other workarounds.
First there is the OCSP Sweeper (hosted at nagiosexchange) it's been around for ages, and let's you bulk send send_nsca checks at FIFO limit, and/or at a time interval.
If your distributed nagios solution is really big, then one idea is to send state changes to your central monitoring server. You can always do your perfdata parsing on your slaves, and make a drilldown link to that data on your central server with a quick apache proxy defenition.
Posted by: Andre Bergei | January 26, 2007 at 02:39 PM
Andre,
Thanks for the pointer to OCSP Sweeper. We discounted options using another daemon - we didn't really want to add another daemon onto our system. Ideally, the "caching" should be done at the Nagios level so if we feel there is a need for a persistent caching mechanism, we'll probably look at amending Nagios (though I think Nagios 3 should cater for this - we haven't looked in detail yet).
I like the idea re: sending only state changes - we have to give it some more thought. At the moment, there is an assumption that the master knows every state, not just failures, so this would be a major change for Opsview. Also, we've designed distributed monitoring so that only a single SSH port is required between the master and slaves. Redirecting perf data to the slave would require a HTTP port open as well.
We're also betting heavily on NDOUtils, which has other techniques for status of slaves (send to single database or a databases per slave) which we haven't investigated fully yet.
Posted by: tonvoon | January 29, 2007 at 11:27 AM
Hi, great script and great idea. I had the idea as well after I saw the terribly serial behaviour of the OCSP and OCHP commands but then I found your script so I didn't have to spend days writing my own solution! Thankyou!
I made a few changes you might find handy, like enabling the script to be used for both the OCSP and OCHP commands, not just for services. Here's my updated version:
#!/usr/bin/perl
#
# SYNTAX:
# send_nsca_cached [cache_time]
#
# DESCRIPTION:
# Used to pass passive results. Caches results and submits at 5 second
# intervals by default. The cache time can be specified on
# command line - 0 to send immediately
#
# Requires Nagios 2.0+
#
# Warning: this script needs to be invoked for a send_nsca to occur, so
# if you only have 1 service on a slave that is run every minute, the
# minimum time between sends is 1 minute, regardless of the cache_time setting.
# So you should only use on a busy slave.
#
# Warning 2: Do not use a cache time that is too large. Even a cache time of
# 1 second will help performance dramatically on a busy slave.
#
# AUTHORS:
# Copyright (C) 2006 Altinity Limited
#
# This file is part of Opsview
#
# Opsview is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# Opsview is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Opsview; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
#
# CHANGELOG:
# v1.0.1 - Oliver Hookins, Anchor Systems, 07/03/2008
# - Replaced hard coded nsca command with a single variable
# - Changed file paths to reflect RHEL standards
# - Altered output sub to handle both OCSP and OCHP commands
#
# v1.0.0 - Altinity Limited, 07/03/2008
# - original downloaded version from http://altinity.blogs.com/dotorg/send_nsca_cached
use strict;
my $cache_time = shift @ARGV;
$cache_time = 5 unless defined $cache_time;
my $nsca_command = "/usr/sbin/send_nsca -H nagios-server -c /etc/nagios/send_nsca.cfg";
if ($cache_time == 0) {
open SEND_NSCA, "| $nsca_command";
print SEND_NSCA &output;
close SEND_NSCA;
exit;
}
my $cache_file = "/var/log/nagios/send_nsca.cache";
my $now = time;
my $last_updated;
if (-e $cache_file) {
open CACHE, "+<", $cache_file;
$last_updated = ;
#print "Last updated: ", scalar localtime $last_updated, $/;
} else {
open CACHE, "+>", $cache_file;
print CACHE $now, $/;
$last_updated = time;
#print "New cache",$/;
}
if ($now - $last_updated < $cache_time) {
seek CACHE, 0, 2; # Goto end
print CACHE &output;
} else {
open SEND_NSCA, "| $nsca_command";
print SEND_NSCA , &output;
close SEND_NSCA;
#print "Will send:", $/;
#print ;
#close CACHE;
#print "Plus this one:", &output;
# Reset time
open CACHE, ">", $cache_file;
print CACHE time, $/;
# Update send_nsca status
my $status_file = "/var/log/nagios/ocsp.status";
open STATUS, ">", $status_file;
if ($? == 0) {
print STATUS "0";
} else {
print STATUS "2";
}
close STATUS;
}
close CACHE;
exit;
sub output {
if ($ENV{NAGIOS_SERVICEDESC} eq "") {
return "$ENV{NAGIOS_HOSTNAME}\t$ENV{NAGIOS_HOSTSTATEID}\t$ENV{NAGIOS_HOSTOUTPUT}\n";
} else {
return "$ENV{NAGIOS_HOSTNAME}\t$ENV{NAGIOS_SERVICEDESC}\t$ENV{NAGIOS_SERVICESTATEID}\t$ENV{NAGIOS_SERVICEOUTPUT}\n";
}
}
Posted by: Oliver Hookins | March 7, 2008 at 06:18 AM
I've tried this in a large system (1000 hosts, 5000 services) and it work fine !
some comments :
* we need to cache services and hosts
* sending all the data take some time, so to not stop nagios scheduling, it is better to fork another process.
so here another version :
#!/usr/bin/perl
#
# SYNTAX:
# send_nsca_cached [cache_time]
#
# DESCRIPTION:
# Used to pass passive results. Caches results and submits at 5 second
# intervals by default. The cache time can be specified on
# command line - 0 to send immediately
#
# Requires Nagios 2.0+
#
# Warning: this script needs to be invoked for a send_nsca to occur, so
# if you only have 1 service on a slave that is run every minute, the
# minimum time between sends is 1 minute, regardless of the cache_time setting.
# So you should only use on a busy slave.
#
# Warning 2: Do not use a cache time that is too large. Even a cache time of
# 1 second will help performance dramatically on a busy slave.
#
# AUTHORS:
# Copyright (C) 2006 Altinity Limited
#
# This file is part of Opsview
#
# Opsview is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# Opsview is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Opsview; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
#
# CHANGELOG:
# v1.0.2 - cedric.cabessa@uperto.com, 13/01/2009
# - fork before sending data
# v1.0.1 - Oliver Hookins, Anchor Systems, 07/03/2008
# - Replaced hard coded nsca command with a single variable
# - Changed file paths to reflect RHEL standards
# - Altered output sub to handle both OCSP and OCHP commands
#
# v1.0.0 - Altinity Limited, 07/03/2008
# - original downloaded version from http://altinity.blogs.com/dotorg/send_nsca_cached
use strict;
my $cache_time = shift @ARGV;
$cache_time = 5 unless defined $cache_time;
my $nsca_command = "/usr/sbin/send_nsca -H nagios-server -c /etc/nagios/send_nsca.cfg";
if ($cache_time == 0) {
open SEND_NSCA, "| $nsca_command";
print SEND_NSCA &output;
close SEND_NSCA;
exit;
}
my $cache_file = "/var/log/nagios/send_nsca.cache";
my $now = time;
my $last_updated;
if (-e $cache_file) {
open CACHE, "+<", $cache_file;
$last_updated = ;
#print "Last updated: ", scalar localtime $last_updated, $/;
} else {
open CACHE, "+>", $cache_file;
print CACHE $now, $/;
$last_updated = time;
#print "New cache",$/;
}
if ($now - $last_updated < $cache_time) {
seek CACHE, 0, 2; # Goto end
print CACHE &output;
} else {
#child send_data, father exit
my $pid=fork();
if (not defined $pid) {
print STDERR "FATAL cannot fork \n";
}elsif ($pid==0){
open SEND_NSCA, "| $nsca_command";
print SEND_NSCA , &output;
close SEND_NSCA;
#print "Will send:", $/;
#print ;
#close CACHE;
#print "Plus this one:", &output;
# Reset time
open CACHE, ">", $cache_file;
print CACHE time, $/;
# Update send_nsca status
my $status_file = "/var/log/nagios/ocsp.status";
open STATUS, ">", $status_file;
if ($? == 0) {
print STATUS "0";
} else {
print STATUS "2";
}
close STATUS;
close CACHE;
}
}
exit;
sub output {
if ($ENV{NAGIOS_SERVICEDESC} eq "") {
return "$ENV{NAGIOS_HOSTNAME}\t$ENV{NAGIOS_HOSTSTATEID}\t$ENV{NAGIOS_HOSTOUTPUT}\n";
} else {
return "$ENV{NAGIOS_HOSTNAME}\t$ENV{NAGIOS_SERVICEDESC}\t$ENV{NAGIOS_SERVICESTATEID}\t$ENV{NAGIOS_SERVICEOUTPUT}\n";
}
}
Posted by: Cédric | January 13, 2009 at 02:58 PM