03 February 2009

Opsview: The Next Chapter

Back in October 2008, Opsera acquired Altinity on the strength of the Opsview product and our customer and user base. It was a great marriage as Opsera were providing IT consultancy and wanted to expand into products, of which Opsview is now added with Ops Mail Manager. And there are some really smart people working at Opsera, which is always a good thing!

Since then, we've been really busy integrating Altinity's IT systems and procedures with the rest of Opsera, taking the best practices from both companies. And with Opsera being a larger organisation, we've had more resources to handle the large number of new customers and work requests.

So it gives us great pleasure to announce the availability of Opsview 3. We've been working really hard over the last few months to get this release out.

The number one question we had towards the middle of last year was: "When will you use Nagios 3"? This was quite funny because we've been involved in getting lots of our changes into Nagios 3, as we've worked with the community on enhancing the core software. But there have been some other great changes in Nagios 3, so it's good to get back to the edge of the Nagios development cycle.

(As an aside, we used to have 39 patches that we made to Nagios 2.10, but against Nagios 3.0.6 we only have 26 patches - our aim is to have as little as possible. We've also seen one problem during our testing which we've contributed back and has been included in the main code. We've got some more patches to Nagios which are good for general use, which we'll blog about soon.)

So, Opsview 3 is here now, based on Nagios 3 and lots of other goodies. And it is all released under GPL.

You can see more about Opsview and our unique take on status views and configuration in our new Quick Start guide.

05 August 2008

Enhancing NRPE for large output

NRPE is great for getting plugin information from a remote host. We wanted to use it to get passive data regarding events, such as syslog entries that SEC had highlighted. This meant we needed two things: multi-line support and larger amounts of output.

Multi-line is already in NRPE 2.12 - this was added by Matthias Flacke last year. However, the limit for data is 1K.

We wanted to be able to bump that figure up to 16K. There's a common.h variable which is called MAX_PACKETBUFFER_LENGTH which is set to 1024. We found we could increase this value and then more data was returned. But there were two problems with it:


  • it broke backwards compatibility

  • it increased the size of each packetk

The 2nd had an impact on the network. Instead of 1K packets being sent between client and server, we now got 16K packets sent, even if the data contained was small.

The first was worst: it meant you needed to update the client (check_nrpe) with the server (nrpe) at the same time, otherwise you'd get lots of NRPE errors in Nagios with only one change.

So we've designed a compatible way: we've added a new packet type called RESPONSE_PACKET_WITH_MORE.

The idea is that check_nrpe will see if the packet returned is of the type RESPONSE_PACKET_WITH_MORE. If so, it will read subsequent packets and append that to the existing data, until it gets a RESPONSE_PACKET. So to read 16K worth of data, check_nrpe reads 16 x 1K packets. Of course, only updated nrpe daemons will send this, so this remains fully backwards compatible with existing nrpe daemons.

The patch is here. We've also cleanup up some of the graceful_close calls.

Now the process to update your NRPE agents would be:


  1. update the central check_nrpe, then

  2. update your agents at your leisure


And you won't get any alerts during this period!

Note: during testing, we found that the limit for returned data from some linux kernels was 4K, even though nrpe was coded with 16K as the limit. This is due to kernel limitations in using pipe() for the interprocess communication.

13 March 2008

NDOutils on Solaris 10

Michael Prochaska was having trouble with compiling NDOutils on Solaris 10. Since we have an interest in getting Opsview working on Solaris (the upcoming 2.12 release will add Solaris 10 as a supported platform), we offered to help. So this is the result of his company, Bacher Systems, sponsoring our work.

Continue reading "NDOutils on Solaris 10" »

15 January 2008

Monitoring Cisco Netflow Data

Netflow is a great feature of Cisco IOS that allows you a view into the traffic that flows over your Cisco network devices, what that traffic is, where it came from and where it is going.

We wanted to make good use of this information and so we started looking for a way for Opsview to monitor it.

With a little configuration of IOS and some open source magic we achieved just that. Now our Opsview servers are keeping tabs on the data moving across our Cisco devices.

So true to our open source way of life we published our setup as part of the Opsview documentation.

08 January 2008

NSCA's aggregate writing

In our continual task to try and speed up Opsview, we found a bug in NSCA's handling of aggregate writes when run in --single mode.

The specific failure scenario is this:


  1. NSCA and Nagios are told to start up
  2. A send_nsca request is received by NSCA before Nagios has created the nagios.cmd command pipe
  3. NSCA tries to write to open the command file, but sees it is not there
  4. NSCA opens the alternate dump file instead

Now when Nagios does create the nagios.cmd file, NSCA uses that ... unless aggregate mode is on and daemon mode is --single. In this case, it continues to use the alternate dump file, thus Nagios doesn't see the results from the slaves.

Here's the patch, which we've also added into our source for Opsview.

As we are very keen on good testing, we've managed to recreate the failing behaviour in a test script. You also need a test configuration file and a patch to the test framework. If you run this test, it will show the error and then after the patch is applied, the test should pass.

29 September 2007

Nagios Patch Day!

With Nagios 3 rapidly approaching and Opsview celebrating being a full open source project (GPL licensed, source code repository online, Sourceforge project), we think it is time to share some of our Nagios patches.

These are the latest patches you can find for Opsview within our code repository. Some are Opsview specific, but a lot can be incorporated into the core code - we'll say which is which. You can see all these on our SVN site (we've even tagged the current version so this will stay in our repository), but here's the lowdown:

Freshness checking, with separated file and tests

If there's a patch we definitely want to have applied to the core code, it will be this one. Not because of freshness checking per-se (though we'll explain why later), but because of the included libtap tests.

As much as we love Nagios, we're always a bit concerned that regressions may occur. We have complete faith in Ethan, but he's human and unintended effects may occur. In fact, we made one when we originally suggested this freshness patch, so with testing, future changes should hopefully not cause regressions.

It requires work to add in tests and to separate out files, and we intend to stick to our commitment to add new tests in. But the framework needs to be put in place to encourage other tests, otherwise the overhead for Altinity is too high.

Think of tests like this: the code is the generalised form; the test is under specific conditions. The key is to try and get more and more conditions to prove that things work as expected.

Have a look at the test - we think it is easy to see what is being tested here (to get it from our svn repository, you need to extract the tarball). And note how comprehensive it is - we think every case is considered. A change in logic anywhere will be immediately spotted.

We've refactored how the calculated freshness_threshold is arrived at so that we can run tests against it.

There's also an arbitrary 15 seconds added to the freshness threshold. We've made that a new variable called additional_freshness_latency in nagios.cfg, so you can tweak it without recompiling Nagios.

Could be applied to core. Please :)


More freshness tweaks

Another thing we found was that Nagios is very fast in reading 10,000 services (5 seconds), but slows down dramatically with NDOutils integrated (2 minutes). It appears to be reading configuration and then sending to the broker modules. Since NDOUtils is synchronously updated, Nagios is waiting while mysql is running the necessary SQL. We've updated the freshness code by introduced a new variable called monitoring_start. This is when Nagios actually starts monitoring, as opposed to program_start which is the HUP time. We get a better idea of how long it takes Nagios to startup.

We've written a little plugin that returns performance data about the startup time.

Also, we've pushed the threshold forward a little bit more to include the max_host_check_spread/max_service_check_spread, which is important for new services.

We've updated the tests to reflect the changes. Patches on top of other patches get really hard to maintain, which is why we need the libtap tests integrated into the core code.

Could be applied to core.

Initial passive state as OK

This is one where we change the Nagios CGIs to show passive states as OK. We just like everything green.

We don't expect this in core.

Issue commands

This has been applied to Nagios 3.

Status link to Nagiosgraph

This helps with our integration to Nagiosgraph.

We don't expect this in core.

Passive checks do not check host

We've discussed this before.

We don't expect this in core.

Ignore certain retained data

We've mentioned this before on the nagios-devel mailing list. Ethan has made changes to Nagios 3 to support this behaviour.

Adding a time=X to the statusmap

With the AJAX goodies we have in Opsview, we found the statusmap wasn't updating correctly. It appears that some browsers try to use cache data in an XHTTPrequest if the URL is the same. We've added this to the URL so that it is always unique.

This is AJAX specific, so we don't expect this in core.

W3 validation: history.cgi

We've big on valid HTML. Partly this is because we wrap the CGI output and remove the use of framesets in Opsview. However, it means the HTML has to be valid. We found several problems in the validity of history.cgi and other CGIs below.

A great tool is HTML Validator, which runs as a Firefox plugin - this tells you if your HTML is valid.

Could be applied to core.

Esccalation via notification levels

This is an extra field to the contact stanza where you can specify that they will receive notifications only after the Nth notification. This makes it an easy way of doing escalations.

We've spotted an issue where if no notifications are sent, the notification number doesn't get incremented. Maybe this is best as a different macro.

Could be applied to core, but requires a bit more thinking.

Documentation patches for validation

The use of markup caused problems for us, so we've fixed some of the docs.

Could be applied to core.

W3 validation: extinfo.c

We've fixed some validation errors with divs. Have you seen HMTL Validator? :)

Could be applied to core.

Trust authentication

This patch stops the Author box from being altered by the logged in user. Ethan has applied something similar to Nagios 3.

Already in Nagios 3.

Slice services within hosts

This patch allows a contact in a contactgroup to only see a subset of services. Normally, a contact to a host sees all the services, but this allows the contact to only see the services specified.

This is possible by setting the contact to not have the host in the contactgroup, but then that stops the contact from taking action on that host.

This could be applied to core, but is a (relatively) major change to the use of contactgroups.

Extinfo icon links to service notes

We find that users click on the extinfo icon and then get a bit worried when nothing happens. We make it a clickable link. There are also a few validation fixes here too (should really be separated out).

Could be applied to core.

Object dump

This is a good one! As you know, we love tests. One thing we do with Opsview is make sure that the configuration being generated is the same after we've tinkered with the rules. We tried to find a good way of doing this - initially we thought about using Nagios::Object to read the config data and then do a diff to find the changes. However, this didn't take into account all the relationships.

What we really wanted was some expanded form of the config files.

It then hit us - Nagios already does this! It uses the object.cache file as an expanded version of the configuration objects for the CGIs to use. So we've patched the core nagios executable so -o will now output to stdout this cache file and then exit. It works great in our testing.

Could be applied to core.

Retain status file over a reload

In our quest to make Nagios more friendly, there's nothing worse than getting the dreaded "Nagios is not running" screen on your browser. This patch adds a new command line option -F, for fast-reload.

It does two things:


  • It doesn't delete the status file on a HUP signal. This gives the impression that Nagios is still running even though no new status information is being updated. We think this is acceptable - after all, CGIs are displaying the "latest" data, it just so happens that there is no update at this precise moment. The status file age doesn't change, so nagiostat will show that the data is getting older, but it removes that scary screen

  • We ignore the pre-flight check. As part of Opsview, we validate the config before we send a HUP signal, so this is redundant. Along with the long startup times for Nagios, we find this makes Nagios a lot more responsive for large scale systems

Could be applied to core, possibly as two different command line options.

Check command by time period

This is a nice feature which we've discussed before. We have customers asking to run a different command based on a timeperiod. The most obvious use is altering the thresholds for the load of a server - a server may run batch work overnight thus increasing its load.

Could be applied to core.

Using relative path names for config files

We run tests internally on new versions of Opsview, trying to prove that our generated config files do not change unexpectedly. One thing we hit was the use of full path names in nagios.cfg. This meant we either had to change the path on the fly or move directories around.

This patch allows the use of a relative path. The path is taken as relative to the directory that holds nagios.cfg. We find this works really well.

There is a dependency on dirname(), which will probably have to be changed to a cross platform implementation.

Could be applied to core.

Making forcecheck option

By default, force check is on when you Reschedule an active check. In a distributed environment where you have a "set to stale" script as the active check, this is not wanted. We change it so that the form enables only if the field is passed through.

We then alter some of the links so that the field is off by default based on whether the service is actively checked.

Could be applied to core.

Add hosts to hostgroup in same order

We make the members field in the contactgroups stanza optional. What this means is that we can add the members of a contactgroup via the contact instead. This turns out to be significantly faster in our configuration generation scripts. Thus we also remove the error in the nagios configuration about the stanza information.

We also add the contacts into the list in the same order as they are processed. When it was added in reverse order, our tests were failing because the order was not preserved.

Could be applied to core.

Handle initial state

In NDO, if a service starts up in an error state, a state change is recorded. However, if a service starts up in OK, a state change is not. This patch will cause a state change to occur.

Technically it is a state change from a PENDING to an OK, so it should be recorded. This helps us in the NDO nagios_statehistory table, which we'll discuss about more in a future blog.

Could be applied to core.

Validation error in statusmap cgi

An incorrectly placed </form> caused problems with our AJAX screens. This fixes. Did we mention HMTL Validator?

Could be applied to core.

Latency values for passive checks

While working on freshness checking, we discovered that the latency values were incorrect. In fact, looking in the NDO db told us this. This fixes the calculation.

Could be applied to core.

Do not resend retained status to NDO

On startup, Nagios writes all the current host/service status to NDO. However, the database already knows this. This causes problems on large scale systems.

A side effect is that if NDO is switched on after Nagios is running for a long time, each object needs to have a new status result before NDO sees it, but this is probably acceptable.

Another impact is that other future broker modules might want the retained status information, so maybe this is best implemented at the broker level, but we couldn't see an easy way of passing only this particular case.

This also has an impact to NDO, so there's a patch required there.

Could be applied to core.


Segfault when processing no output


We had a big problem with a customer's system where it was crashing occasionally. We had to analyse coredumps and eventually found the problem: on the master server, if the plugin output is only "|" for a passive host check, then sometimes a segfault would occur.

We think this is related to parsing the plugin output, but only if passive checks are processed with a backtrace from check_host.

Anyway, we've fixed it by changing the algorithm for parsing the plugin output. Our guess is that strtok is causing the problem, but we really don't understand why. Sigh.

With this patch, our customer's Nagios has not crashed for a 1 month - so we're safe!

Code be applied to core.


Returning passive latency values in nagiostats


With the fix to the passive latency values, we then want to find out what the values are for passive latency over a long period of time.

This patch updates nagiostats.

Code can be applied to core.


Is that all?


Yes, for now! We've made lots of changes to Nagios over the last 12 months, which we think are suitable for core. Sorry for not publishing them sooner.

If you want to have an Altinity compiled version of Nagios, just do this:


cd /tmp
svn export http://svn.opsview.org/opsview/trunk/opsview-base
cd opsview-base
make nagios

This will patch Nagios and run ./configure with our usual settings (there are some dependencies (autoconf, automake) required, but we'll leave that for you to work out!). You'll get the exact version of Nagios that we use in Opsview - in fact, you'll get them before our customers get them!

We'll do a similar Patch Day for NDO soon and talk about some of the performance tuning we've been doing for our large customers.

Enjoy!

18 July 2007

SMS alerting via AQL

We came across AQL by accident. They came to us because they were interested in Opsview and we looked into what their company was about. They provide SMS messaging services: you buy a prepaid amount of credits and then you can send SMS text messages via their website in a variety of ways.

systempreferences.png

Our sales director thought it would be a good idea to integrate their service with Opsview. We agreed and thought there was a nice synergy about it.

So we've now integrated AQL's messaging through our UI. In the upcoming 2.8 release, there's a new screen: System Preferences. Here you can sign up at AQL and then enter the username and password. We even give you a little Check credits AJAX button for you to test your connectivity.

mobilenumber.png

Then on the Contacts screen, you enter in your mobile phone for sms number (with javascript validation so that it is in the correct format) and you can even send a test SMS to make sure this works correctly.

Finally, when Nagios is ready to alert, we send the notification via the SMS instead of email or RSS. Simple!

Actually, it was quite hard. We just like to make it look easy.

To communicate with AQL's servers, there are various methods: HTTP/HTTPS, XMPP, SOAP and a few others that made my eyes water. We just wanted a nicely encapsulated module to send a message.

And we found one on CPAN. SMS::AQL is a perl module written by David Precious. It works over HTTP and worked a treat. However, we initially worked with version 0.02 and the tests there were failing because it was trying to contact AQL's servers to do testing. This caused us some problems in our automated perl install.

So we set to work enhancing the module. First thing was to update the tests. Using Test::Mockobject, we were able to reply to SMS::AQL's HTTP calls as if they were being returned from AQL's servers. This allowed some really intensive testing. Using Devel::Cover, we got a 91% coverage in our testing! We found lots of inconsistencies in the API, which we fixed as well. Finally, we cleaned up the messages so there is a single lookup table now.

The guys at AQL have been very helpful in providing us with technical information. And David Precious has updated the perl module with our changes. And he's written a blog post too!

It's a symbiotic way of working - we didn't start from scratch working on an interface with AQL's systems, but we've managed to contribute back to existing code and move it up another level. Everybody wins! (Well, except for other monitoring system companies that want to be international conglomerates.) So now everyone can use the CPAN module to get SMS alerting.

But if you want a quick way of sending alerts, you can download our script here. This is the script that will be distributed with the 2.8 release soon. Just add that onto your server and put a check command entry into Nagios like:

define command {
	command_name service-notify-by-sms
	command_line /usr/local/nagios/bin/submit_sms_aql -u aql_username -p aql_password -n $CONTACTPAGER$ -t "$SERVICEDESC$ on $HOSTNAME$ is $SERVICESTATE$: $SERVICEOUTPUT$ ($SHORTDATETIME$)"
}

To be honest, I can't remember all the associations with the contact definitions - check out the Nagios documentation to set it up. I just use Opsview because it makes Nagios easier to administer. And now, Opsview makes SMS alerting easier too.

21 June 2007

Tweaking the freshness checking algorithm

With Opsview, one of the big features is the simple distributed monitoring - you just select a drop down to associate a host with a slave server and then when you hit the Opsview reload button, all the Nagios configurations are generated as you'd expect (slaves monitoring, master with freshness checking, automatic distribution to slaves, synchronized reloading). It works amazingly well.

But one of the niggly issues we have is that some services go stale before we think they should. So we've been tweaking some of the algorithms for setting the freshness_threshold.

One situation we found was that when the master was being restarted, a busy master can lose some slave results during the reload (due to the infamous command pipe being full limitation). So when the master comes back, it could lose one polling cycle's result from the slave and mark the service as stale before the slave has had a chance to send the next result.

So we patched it, by adding the freshness_threshold to the program_start time instead of the service's last check time. And we sent an the email to the nagios-devel mailing list to inform. This was accepted into Nagios 2.1. And we got less stale results - hooray!

Roll forward a year. Michelle Craft then discovered that this patch caused a problem - if you set a passive service to have a freshness_threshold of 1 day, but you restart Nagios every day, then the service never expires its freshness threshold. That's a bad bug, and I'm quite ashamed that slipped through.

Fortunately, we had a solution. Ethan wrote a patch very quickly for Nagios 3, but we wanted something a bit more robust.

At Altinity, we're big fans of testing. This is not because we like to test - heck, we hate testing as much as the next developer. But we hate regression and unintended consequences more. With the Nagios Plugins, there's a really large set of tests that get run for every nightly build, with a nice web page that displays the state. One of the tools that makes it happen is LibTap, a library written by Nik Clayton. This is a way of testing C code with output in a perl test format. Apparently, a lot of FreeBSD tests are being written in libtap to prove there are no regressions.

There are some instructions on the Nagios Plugins site for installing libtap on your development servers.

So we've fixed this problem now by moving the freshness calculation algorithm into a separate file and then writing a small C program with dummy services and hosts to test that the right thresholds are being returned. The benefits were immediate - I found I had put a wrong bracket around an if statement when one of the tests failed.

The patch, which consists of a patch file, a new freshness.c file and a tarball for the new test directory, applies cleanly onto Nagios 2.9. You need to run autoconf afterwards. ./configure will detect the existence of libtap and compile the test executables. Then when you run make test, it should execute the test and make sure it works properly (you may need to export LD_LIBRARY_PATH=/usr/local/lib to get the libtap library detected properly at runtime).

Tests are hard to do, but worthwhile in the long run. I see it as making sure things still continue to work the same way you expect. And that has to be a good thing.

Hopefully this can be the start of some automated testing for our favourite open source monitoring system!

27 April 2007

Changing a service check command depending on time of day

We have been asked by a customer if it is possible to change a check command for a service depending on the time of day.

Why would this be useful?

Well, if a server runs time critical processes during the day and slow running batch processes over night, how can a service check command take into account how it is supposed to report on CPU or memory usage without generating false alerts? Yes, you could write your own plugins to take account of the time and react accordingly for each check this needs to be for, but these would have to be installed on each host for each service, the wealth of plugins from http://www.nagiosexchange.org/ cannot easily be used, setting the system up takes longer, and it is all much harder to maintain.

Instead, we have made changes to the service stanza within the Nagios configuration files to include a "check_timeperiod_command <timeperiod>,<command>" entry:

define service {
	host_name server1
	service_description Free Widgets
	check_command check_widget -w 40% -c 20%
	check_timeperiod_command nonworkhours,check_widget -w 5% -c 2%
	.....
}

You get the idea....

check_command provides the default check for the day. During the nonworkhours period, the alternative command and arguments are used instead.

This seems far too useful to the community to keep to ourselves, so we offer the patch for Nagios 2.8 here, for peer review and comments (all of which are very welcome).

And here is a patch for ndoutils 1.4b2 that goes with it.

Enjoy!

Update: Patches for Nagios 3.0.6 and NDOutils 1.4b7 are available

02 April 2007

Better mysqlclient detection for NDOUtils

We've encountered some problems with mysql detection in NDOUtils - it doesn't work on one of our redhat servers. The specific problem is that the ceil function is not found, which is because -lm is missing from the list of libraries to add at link time:


utils.o(.text+0x14e): In function `ndo_dbuf_strcat':
: undefined reference to `ceil'
collect2: ld returned 1 exit status

Rather than adding that library in manually (along with the -lz library that we found earlier for Mac OS X), we should use information from mysql_config to construct the compile flags. However, this is a bit tricky because of the various permutations.

Fortunately, the Nagios Plugins have a solution already. They have an m4 file, called np_mysqlclient.m4, that is used to detect mysql_config and this returns data from the msyql_config for configure to use.

So we've patched NDOUtils so that it uses this m4 file now. In order to use, you have to apply the patch to configure.in, add a new m4/ directory to the top level and copy np_mysqlclient.m4 into m4/. Then run:

aclocal -I m4
autoconf
./configure --with-mysql=DIR

The detection is the same as in the Nagios Plugins: ./configure will try to find mysql_config in DIR/bin/mysql_config, otherwise will look in the PATH.

The nice thing is that if the logic for detection needs to be enhanced, we can update the m4 file and propagate the changes back to the Nagios Plugins as well. So everyone wins!

There's also a patch for CFLAGS in src/Makefile.in (which were getting overridden - presumably for testing), a small header change in config.h.in and some Makefile.in changes because make errors were getting lost by the cd .. command.

We've tested this on a Mac OS X server, a Debian Etch server, and 32bit and 64bit Redhat, and it is looking good.

Unfortunately, it means deprecating the --with-mysql-inc and --with-mysql-lib configure options. Hopefully, you'll see why this way is so much nicer.

Here's the patch against CVS HEAD.

Update: Here's the patch, reworked for NDOutils 1.4b3

Update: You can get the tarball with just this patch here