18 July 2007

SMS alerting via AQL

We came across AQL by accident. They came to us because they were interested in Opsview and we looked into what their company was about. They provide SMS messaging services: you buy a prepaid amount of credits and then you can send SMS text messages via their website in a variety of ways.

systempreferences.png

Our sales director thought it would be a good idea to integrate their service with Opsview. We agreed and thought there was a nice synergy about it.

So we've now integrated AQL's messaging through our UI. In the upcoming 2.8 release, there's a new screen: System Preferences. Here you can sign up at AQL and then enter the username and password. We even give you a little Check credits AJAX button for you to test your connectivity.

mobilenumber.png

Then on the Contacts screen, you enter in your mobile phone for sms number (with javascript validation so that it is in the correct format) and you can even send a test SMS to make sure this works correctly.

Finally, when Nagios is ready to alert, we send the notification via the SMS instead of email or RSS. Simple!

Actually, it was quite hard. We just like to make it look easy.

To communicate with AQL's servers, there are various methods: HTTP/HTTPS, XMPP, SOAP and a few others that made my eyes water. We just wanted a nicely encapsulated module to send a message.

And we found one on CPAN. SMS::AQL is a perl module written by David Precious. It works over HTTP and worked a treat. However, we initially worked with version 0.02 and the tests there were failing because it was trying to contact AQL's servers to do testing. This caused us some problems in our automated perl install.

So we set to work enhancing the module. First thing was to update the tests. Using Test::Mockobject, we were able to reply to SMS::AQL's HTTP calls as if they were being returned from AQL's servers. This allowed some really intensive testing. Using Devel::Cover, we got a 91% coverage in our testing! We found lots of inconsistencies in the API, which we fixed as well. Finally, we cleaned up the messages so there is a single lookup table now.

The guys at AQL have been very helpful in providing us with technical information. And David Precious has updated the perl module with our changes. And he's written a blog post too!

It's a symbiotic way of working - we didn't start from scratch working on an interface with AQL's systems, but we've managed to contribute back to existing code and move it up another level. Everybody wins! (Well, except for other monitoring system companies that want to be international conglomerates.) So now everyone can use the CPAN module to get SMS alerting.

But if you want a quick way of sending alerts, you can download our script here. This is the script that will be distributed with the 2.8 release soon. Just add that onto your server and put a check command entry into Nagios like:

define command {
	command_name service-notify-by-sms
	command_line /usr/local/nagios/bin/submit_sms_aql -u aql_username -p aql_password -n $CONTACTPAGER$ -t "$SERVICEDESC$ on $HOSTNAME$ is $SERVICESTATE$: $SERVICEOUTPUT$ ($SHORTDATETIME$)"
}

To be honest, I can't remember all the associations with the contact definitions - check out the Nagios documentation to set it up. I just use Opsview because it makes Nagios easier to administer. And now, Opsview makes SMS alerting easier too.

21 June 2007

Tweaking the freshness checking algorithm

With Opsview, one of the big features is the simple distributed monitoring - you just select a drop down to associate a host with a slave server and then when you hit the Opsview reload button, all the Nagios configurations are generated as you'd expect (slaves monitoring, master with freshness checking, automatic distribution to slaves, synchronized reloading). It works amazingly well.

But one of the niggly issues we have is that some services go stale before we think they should. So we've been tweaking some of the algorithms for setting the freshness_threshold.

One situation we found was that when the master was being restarted, a busy master can lose some slave results during the reload (due to the infamous command pipe being full limitation). So when the master comes back, it could lose one polling cycle's result from the slave and mark the service as stale before the slave has had a chance to send the next result.

So we patched it, by adding the freshness_threshold to the program_start time instead of the service's last check time. And we sent an the email to the nagios-devel mailing list to inform. This was accepted into Nagios 2.1. And we got less stale results - hooray!

Roll forward a year. Michelle Craft then discovered that this patch caused a problem - if you set a passive service to have a freshness_threshold of 1 day, but you restart Nagios every day, then the service never expires its freshness threshold. That's a bad bug, and I'm quite ashamed that slipped through.

Fortunately, we had a solution. Ethan wrote a patch very quickly for Nagios 3, but we wanted something a bit more robust.

At Altinity, we're big fans of testing. This is not because we like to test - heck, we hate testing as much as the next developer. But we hate regression and unintended consequences more. With the Nagios Plugins, there's a really large set of tests that get run for every nightly build, with a nice web page that displays the state. One of the tools that makes it happen is LibTap, a library written by Nik Clayton. This is a way of testing C code with output in a perl test format. Apparently, a lot of FreeBSD tests are being written in libtap to prove there are no regressions.

There are some instructions on the Nagios Plugins site for installing libtap on your development servers.

So we've fixed this problem now by moving the freshness calculation algorithm into a separate file and then writing a small C program with dummy services and hosts to test that the right thresholds are being returned. The benefits were immediate - I found I had put a wrong bracket around an if statement when one of the tests failed.

The patch, which consists of a patch file, a new freshness.c file and a tarball for the new test directory, applies cleanly onto Nagios 2.9. You need to run autoconf afterwards. ./configure will detect the existence of libtap and compile the test executables. Then when you run make test, it should execute the test and make sure it works properly (you may need to export LD_LIBRARY_PATH=/usr/local/lib to get the libtap library detected properly at runtime).

Tests are hard to do, but worthwhile in the long run. I see it as making sure things still continue to work the same way you expect. And that has to be a good thing.

Hopefully this can be the start of some automated testing for our favourite open source monitoring system!

26 January 2007

The importance of being earnestly tested

We ran across a problem with NSCA 2.6 yesterday day. It turned out that running the nsca daemon in single mode only works for the first packet of data from send_nsca and hung for subsequent calls.

This was actually first discovered by Rudolf van der Leeden and it looks like it has been with us since April 2006 when NSCA 2.6 was first released, through to the current NSCA 2.7. We never picked it up until running it on a customer site which was tuned to use --single.

The fix is as Rudolf suggests - uncommenting the if statement that was removed. Our patch is here.

How do we know it works? Well, we've written a series of test scripts for NSCA.

We've always been a big fan of testing. We love using the Test Anything Protocol (TAP) in Perl. CPAN encourages you to write good tests to make sure your Perl modules run, which is why we know that modules we're uploaded to CPAN continue to work while we've been updating them. And we've provided quite a few fixes to CPAN modules where the tests fail (and some just suggest that we have a broken version of perl).

Here's the test scripts for NSCA. They are more like functional testing - it tests that the daemon can start up and accept messages and compares the output in the dummy nagios.cmd file with the sent data. Unit testing is a bit more tricky to do for C code - though libtap is being used for the Nagios Plugins.

To use the test scripts, drop it down to the top level of the NSCA directory after you've compiled NSCA and cd into nsca_tests. Run prove *.t. You will require several CPAN modules: Test::More, Class::Struct, Clone and Parallel::Forker, though most will be with your perl distribution.

There are 3 tests at the moment:


  • basic - just sends a few passive checks and makes sure that the nagios.cmd file receives them

  • multiple - runs the same as basic, but several times to check the daemon can handle multiple requests

  • simultaneous - runs lots of send_nscas at the same time (well, nearly). Uses Parallel::Forker to setup all the sends then executes them all at once. Expect about 200 extra processes to hit your server!

You'll find that multiple and simultaneous tests fails with the stock NSCA 2.6 and 2.7. But when the patch is applied, all the tests work.

The tests can obviously be extended, but this is a start and covers this basic functionality.

We hope Ethan will look into adding this to the NSCA distribution.

We're upset that something like this got to one of our customers, but we're more upset with ourselves for not catching this much earlier. This should be a good step towards better QA of future NSCA releases.

Update: Ethan has updated NSCA to 2.7.1 to fix this problem. And the tests are included as well!