With Opsview, one of the big features is the simple distributed monitoring - you just select a drop down to associate a host with a slave server and then when you hit the Opsview reload button, all the Nagios configurations are generated as you'd expect (slaves monitoring, master with freshness checking, automatic distribution to slaves, synchronized reloading). It works amazingly well.
But one of the niggly issues we have is that some services go stale before we think they should. So we've been tweaking some of the algorithms for setting the freshness_threshold.
One situation we found was that when the master was being restarted, a busy master can lose some slave results during the reload (due to the infamous command pipe being full limitation). So when the master comes back, it could lose one polling cycle's result from the slave and mark the service as stale before the slave has had a chance to send the next result.
So we patched it, by adding the freshness_threshold to the program_start time instead of the service's last check time. And we sent an the email to the nagios-devel mailing list to inform. This was accepted into Nagios 2.1. And we got less stale results - hooray!
Roll forward a year. Michelle Craft then discovered that this patch caused a problem - if you set a passive service to have a freshness_threshold of 1 day, but you restart Nagios every day, then the service never expires its freshness threshold. That's a bad bug, and I'm quite ashamed that slipped through.
Fortunately, we had a solution. Ethan wrote a patch very quickly for Nagios 3, but we wanted something a bit more robust.
At Altinity, we're big fans of testing. This is not because we like to test - heck, we hate testing as much as the next developer. But we hate regression and unintended consequences more. With the Nagios Plugins, there's a really large set of tests that get run for every nightly build, with a nice web page that displays the state. One of the tools that makes it happen is LibTap, a library written by Nik Clayton. This is a way of testing C code with output in a perl test format. Apparently, a lot of FreeBSD tests are being written in libtap to prove there are no regressions.
There are some instructions on the Nagios Plugins site for installing libtap on your development servers.
So we've fixed this problem now by moving the freshness calculation algorithm into a separate file and then writing a small C program with dummy services and hosts to test that the right thresholds are being returned. The benefits were immediate - I found I had put a wrong bracket around an if statement when one of the tests failed.
The patch, which consists of a patch file, a new freshness.c file and a tarball for the new test directory, applies cleanly onto Nagios 2.9. You need to run autoconf afterwards. ./configure will detect the existence of libtap and compile the test executables. Then when you run make test, it should execute the test and make sure it works properly (you may need to export LD_LIBRARY_PATH=/usr/local/lib to get the libtap library detected properly at runtime).
Tests are hard to do, but worthwhile in the long run. I see it as making sure things still continue to work the same way you expect. And that has to be a good thing.
Hopefully this can be the start of some automated testing for our favourite open source monitoring system!
Comments