15 January 2007

Lessons in .... SNMP trap handling, part 3

It has been some time since we last talked about SNMP trap handling, but there's been some major developments. Recall we use the perl module SNMP::Trapinfo to process a incoming trap. We think this works really well, but there was a major piece of functionality our customer wanted:


Complex calculation of whether a trap passes a test

And by complex, we mean complex. Here's an example trap:


dastardly.altinity.net
10.243.196.251
SNMPv2-MIB::sysUpTime.0 119:2:04:40.34
SNMPv2-MIB::snmpTrapOID.0 CERENT-454-MIB::remoteAlarmIndication
CERENT-454-MIB::cerent454NodeTime.0 20060814114937D
CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication notAlarmedNonServiceAffecting
CERENT-454-MIB::cerent454AlarmObjectType.9216.remoteAlarmIndication ds1
CERENT-454-MIB::cerent454AlarmObjectIndex.9216.remoteAlarmIndication 9216
CERENT-454-MIB::cerent454AlarmSlotNumber.9216.remoteAlarmIndication 2
CERENT-454-MIB::cerent454AlarmPortNumber.9216.remoteAlarmIndication port36
CERENT-454-MIB::cerent454AlarmLineNumber.9216.remoteAlarmIndication 0
CERENT-454-MIB::cerent454AlarmObjectName.9216.remoteAlarmIndication DS1-2-36-7
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 216.243.196.251

Our customer wanted to be able to say: "Give me a critical alert if cerent454AlarmState.9216.remoteAlarmIndication is not 'cleared' and the cerent454AlarmSlotNumber is greater than 5". Well, this was impossible with our previous setup. I still don't know why it is called Simple Network Management Protocol...

We sat down to think about this and then realised we probably need an arbitrary way of calculating an SNMP trap, but the last thing we wanted to do was write a syntax parser. That would involve a whole new language, all the parsing work involved, etc, etc. This would take months of work!

Looking for inspiration, we realised OpenNMS has claimed this type of functionality. We downloaded a copy and tried to install it, but hit loads of pre-requisites. We're very lazy - we should evaluate other technologies, but if it is too much of a pain to install, then we'll give up right away!

Undeterred, we went for the next best thing - their documentation! Searching around, we found the section on evaluating traps. It appears that OpenNMS have a table called events, which is a list of all the things that happened. Then there are various filters which evaluate against those events to work out whether something needs to be alerted on. SNMP traps are converted into this event format and dropped into that table.

(As an aside, Nagios holds no such processing logic. All that complicated processing is handled by the plugins. Nagios only cares about the result. This is a feature :) )

It then dawned on us the beauty part of OpenNMS' design: rules are expressed as SQL statements.

Let me repeat that again: rules are just SQL statements. If the SQL evaluates to 1, then an alert is raised, otherwise ignored. Fantastic! This does away with all the "design your own syntax" work, with a clear, recognised language! No duplication of work!

So the above requirement could be met with a rule in OpenNMS (we think! We haven't actually tried this!) that says:

(cerent454AlarmState != 'cleared') & (cerent454AlarmSlotNumber > 5)

which equates to a SQL statement like:

SELECT ipaddr
FROM ipinterface
WHERE ipaddr in (SELECT ipaddr FROM ipinterface, node
WHERE cerent454AlarmState != 'cleared'
AND ipinterface.nodeid =node.nodeid)
AND ipaddr in (SELECT ipaddr FROM ipinterface, snmpInterface
WHERE cerent454AlarmSlotNumber > 5
AND ipinterface.ipaddr = snmpInterface.ipaddr);

But we couldn't do that with SNMP::Trapinfo - no SQL database. Tacking on DBI.pm support would be terrible. But then it hit us - why not use Perl? Most sysadmins know perl syntax and it would allow useful functionality like regular expressions, which are not as powerful in SQL.

How do we express the SNMP trap variables? Well, we already have that in SNMP::Trapinfo - macros. ${CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication} evaluates as notAlarmedNonServiceAffecting in the example trap, but instead of making it a line to display, wrap it up in some perl code:

"${CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication}" eq "cleared"

(These Cerent devices also make it difficult to find a specific variable because it encodes the object index number, 9216, into the oid name. Sigh - no one said SNMP had to be Simple or consistent. To overcome this, we introduced the idea of a wildcard for an OID tuple, so the above could be written as "${CERENT-454-MIB::cerent454AlarmState.*.remoteAlarmIndication}" eq "cleared". There are some issues if there are multiple OIDs which match this name, but we assume that only one matches...)

There's a new method in SNMP::Trapinfo called eval. This evaluates the string as a snippet of perl code and gets the return code. There are three possible results that come back from the eval:


  • 1 = true - the perl snippet runs and evaluates true

  • 0 = false - the perl snipper evaluates as false

  • undef = error - the perl code did not run correctly (most likely is syntax errors)

This last case is possible if the variable name does not exist. For instance, the expansion of '${CERENT-454-MIB::cerent454AlarmSlotNumber.*.remoteAlarmIndication} > 5' would convert to ' > 5' which is not valid perl code if the trap coming in did not contain the desired variable.

So our way of expressing the rule required is:


"${cerent454AlarmState.9216.remoteAlarmIndication}" ne "cleared" && cerent454AlarmSlotNumber.9216.remoteAlarmIndication > 5

We have a basic wrapper script that if this code returns as true, we send a passive check to Nagios.

One final thing: we have a front end application to configure the perl snippet of code. This is obviously tainted. We don't necessarily know what is contained in the code, so it could do things like "system('rm -fr $HOME')". We added on the Safe module, so now it is restricted to only running specific operators, like the comparison and regexps and mathematical functions. Good security lets us sleep at night :)

SNMP::Trapinfo is now released on CPAN. We use this for our SNMP trap processing and we think it works fantastically well. And this continues our aim of making the base portions of Opsview as solid as possible.

07 June 2006

Lessons in .... SNMP trap handling, part 2

Last time we looked at how to get SNMPtraps received into Nagios. This time we'll show how Opsview handles the configuration of it.

Recall that the new design is:


  • SNMP packet received by snmptrapd

  • snmptrapd's traphandle calls snmptrap2nagios

  • snmptrap2nagios, if applicable, will write to the Nagios command file

In Opsview, we use a web interface to configure the traps we are interested in. On this screen, we define the traps we expect to receive.
list_traps.png


Each trap has an alert level and a message. The message can use macros which are supported by our perl module SNMP::Trapinfo. You define them on this screen.

define_trap.png


If desired, you can deny this trap, so when the trap gets to snmptrap2nagios, it will be discarded. It is possible to deny the trap at snmptrapd, but we haven't done that (as you normally have to be root to change snmptrapd's configuration file. However, this is a worthwhile enhancement if there are lots of traps received).

But that's not all! If a trap is allowed, you can then select to process it or ignore it at the host level. Here's the configuration screen you get for defining your service check. This service check is then associated with a host.
defining_servicecheck.png

What is the defer option? Well, we thought there are 3 possible actions when a trap is received:


  1. You want to process it

  2. You want to ignore it

  3. You weren't expecting it!

So defer means you haven't said either way. This is the default for any new traps.

When this servicecheck has been linked to a host, Opsview will then configure Nagios with a service check called Interface which will accept traps linkUp and linkDown. The state of this servicecheck in Nagios will change dependant on the alert level defined for these traps.

So we have 2 levels of filtering:


  • globally deny a trap

  • ignore a trap on a host basis

What if, say, a Security trap arrives for a host that does not have the service associated with it? This is an exception, which needs manually intervention. Opsview has a table called snmptrapexceptions which stores all the traps that snmptrap2nagios hasn't been told what to do with. When we thought about it, there were 5 distinct error conditions:

  • A trap was received with no valid trapname
  • A trap was received where the trapname was not fully translated - this usually means a MIB file has not been loaded
  • A trap was not recognised - it has not been defined
  • A trap was received, but it was not expected for this host (defer)
  • A trap was received for a host that is not defined to Nagios

We have a screen which shows all the exceptions and then gives operational options on what actions to take next.
exceptions_list.png


Notice there is a Promote Mib button. When we distribute Opsview, we put all our known MIBs into /usr/local/nagios/snmp/all. However, there is a penalty with loading unnecessary MIBs. So we configure snmpd to only load mibs in the default area and /usr/local/nagios/snmp/load.

When you click Promote Mib, we use a perl module called Net::Dev::MIBLoadOrder, which can tell you which MIB a specific OID belongs to. We then copy that MIB file into /usr/local/nagios/snmp/load and restart snmpd. This is one major administrative headache reduced!

Once you've redefined the actions you want, we tell Opsview to reprocess all the snmptrap exceptions based on the new rules (but no passive checks are submitted to Nagios). This will reduce the exceptions table so then an administrator would continue to set new rules until there are no exceptions left.

Astute readers may be wondering: "what happens if I receive a trap which is bad (linkDown) and then a good trap (linkUp) on the same service check"? The answer is that the bad trap will make the service go CRITICAL/WARNING, but the good trap will make the service OK. This means it may potentially get lost. We've made a decision to take this limitation, rather than use the is_volatile or stalking_options. However, we're in discussions with Ethan to see if we can enhance Nagios to cope with this types of event.

Our aim is to make Opsview as easy to use, while we continue to improve Nagios and the general open source universe, through software or knowledge sharing. We hope this gives you an insight how you can get SNMPtraps working with Nagios.

24 March 2006

Lessons in .... SNMP trap handling

SNMP is one of those useful, but mis-understood technologies. I think it doesn't help that the name is Simple Network Management Protocol, yet when you first start, you get hit by these ridiculous OIDs like .1.3.6.1.2.1.1.1.0 for the system description. It just doesn't feel simple. Sigh.

However, every networking device manufacturer supports it - and there's an open source network management system based on it - so we looked into how we could integrate SNMP into Opsview. Polling SNMP devices is already supported through active checks. The next step was receiving traps, which are passive by nature.

There's a good article in Sysadmin magazine by Francois Meehan, where he describes how to get SNMP traps integrated into Nagios. His design is:


  • snmptrap received by snmptrapd
  • snmptrapd calls snmptt (snmp trap translator)
  • snmptt defines what alert levels each trap should take and then writes to syslog
  • SEC can handle correlation of events, but in this case is configured to read syslog and then pass any single event to a custom python script called snmptraphandling.py
  • snmptraphandling.py then puts an entry on Nagios' command file based on the hostname and the alert level

That's a lot of layers! I'm a big fan of the KISS approach, so we went further into how these things worked.

Snmptrapd is from the Net-SNMP project. Though there are other (mainly commercial) implementations, this seems to be the most popular. You configure snmptrapd to invoke a command, called a traphandle, when it receives a SNMP trap. The interface to the traphandle is simple: just call any executable and pass stdin with the:


  1. the host name of the originating packet
  2. the ip of the originating packet
  3. the contents of the packet

An example packet:


cisco2611.lon.altinity
192.168.10.20
RFC1213-MIB::sysUpTime.0 0:18:14:45.66
SNMPv2-MIB::snmpTrapOID.0 IF-MIB::linkDown
RFC1213-MIB::ifIndex.2 2
RFC1213-MIB::ifDescr.2 "Serial0/0"
RFC1213-MIB::ifType.2 ppp
OLD-CISCO-INTERFACES-MIB::locIfReason.2 "administratively down"
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 192.168.10.20
SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 "public"
SNMPv2-MIB::snmpTrapEnterprise.0 CISCO-SMI::ciscoProducts.186

However, snmptt's documentation suggests that you run snmptrapd with the -On flag, which means "do not translate OIDs to names".

So the above equivalent would be received by snmptt as:


cisco2611.lon.altinity
192.168.10.20
.1.3.6.1.2.1.1.3.0 0:18:13:59.95
.1.3.6.1.6.3.1.1.4.1.0 .1.3.6.1.6.3.1.1.5.3
.1.3.6.1.2.1.2.2.1.1.2 2
.1.3.6.1.2.1.2.2.1.2.2 "Serial0/0"
.1.3.6.1.2.1.2.2.1.3.2 ppp
.1.3.6.1.4.1.9.2.2.1.1.20.2 "administratively down"
.1.3.6.1.6.3.18.1.3.0 192.168.10.20
.1.3.6.1.6.3.18.1.4.0 "public"
.1.3.6.1.6.3.1.1.4.3.0 .1.3.6.1.4.1.9.1.186

The reason for this is that snmptt has its configuration file indexed by OID. If you do not use the -On flag, snmptt will translate back into OIDs before finding the right entry.

In order for snmptt to know the OIDs, you have to import MIBs into snmptt and then define what the message and alert level is, using the OID as the key. It will then give you a set of macros which you can use to define your message.

Here's where we disagreed with snmptt's design - why bother importing MIBs? Obviously, snmptrapd needs to understand MIBs and it does a good job of translating OIDs. By giving snmptt that MIB information too means maintaining MIB importing in two places.

When I get stuck trying to understanding the point of something, I ask myself: What is the custom data? This is important because this needs to be maintained and it leads to the answer of What is the value?.

Snmptt's value is that lookup between the OID and the message and alert level (and the default message is not that helpful - it takes the 1st line of the description of the MIB and adds the arguments at the end). This is called the snmptt_conf_files in their language, but I'll call it the message catalogue.

But there is a performance impact with parsing the message catalogue. If snmptrapd calls a perl script which is reading this catalogue at every invocation, then there's going to be a hit if there are lots of traps being received. This is why snmptt has a daemon mode. The last thing we want is another daemon!

So then we thought: "What about leaving snmptrapd to do the translation?" Instead of indexing by OID, we could index by the trapname itself. This leaves all the MIB information at the snmptrapd level - removing our administrative nightmare - and our glue code would just be text parsing, which perl, our tool of choice, is ideally suited for.

This message catalogue is precisely the type of Nagios configuration data that we want Opsview to excel at. In fact, snmptt missed a trick in that it doesn't know which host/service to submit the passive check to. This is left to the snmptraphandling.py script, which just does it by putting onto hostname, then alert level (so every host has 3 and only 3 services with regards to snmptraps).

Our traphandle, which we call snmptrap2nagios, therefore needs to:


  • be fast - it could be invoked hundreds of times a minute
  • process the textual data to convert to a message and an alert level
  • know which service on which host wants this alert
  • submit a passive check to Nagios

Since snmptt has some useful code regarding macros, we need to emulate that. This is generic information and is not tied to the rest of Opsview, so we've written this as a perl module called SNMP::Trapinfo and we've published this on CPAN.

In Francois' design, SEC was not used for any filtering so we've removed it. This removes the need to write to syslog as well.

So now the architecture looks like this:

  • SNMP packet received by snmptrapd
  • snmptrapd's traphandle calls snmptrap2nagios
  • snmptrap2nagios, if applicable, will write to the Nagios command file

Much cleaner!

Stay tuned for the next post when we discuss how we handle filtering and exceptions.

Update: We forgot to credit Alex Burger for his work on SNMPTT, which lots of users appreciate. Also, Ethan has got a page on integration of SNMPtraps in the Nagios documentation which we didn't see until recently.

Update: Part 2 posted here.

24 February 2005

Monitoring filesystem usage with SNMP

Continuing our run of useful SNMP OIDs...

One of the most commonly monitored statictics is filesystem usage. Here is how you do it with SNMP. All OIDs listed are available under MIB-II.

Note: <int> is an integer corresponding to the filesystem number. Most systems will have multiple partitions / filesystems.

Description

.1.3.6.1.2.1.25.2.3.1.3.<int>

Description of filesystem. On a Unix system examples would be / or /home. Under Windows expect C:/, D:/ etc/

Capacity

.1.3.6.1.2.1.25.2.3.1.5.<int>

Capacity of filesystem in blocks

Usage

.1.3.6.1.2.1.25.2.3.1.6.<int>

How many blocks are currently being used to store data

Blocksize

.1.3.6.1.2.1.25.2.3.1.4.<int>

Blocksize in bytes. Important because other stats are in blocks.

Maths

So to find capacity of filesystem in bytes you need to multiply the size in blocks with the block size. Same principle applies to calculating how much of the filesystem is in use.

If you want to display values in Kb / Mb / Gb remember to divide by 1024 each time.

23 February 2005

Monitoring APC UPS - Useful OIDs

If you feel moved to monitor your APC UPS, here are some of the OIDs you'll want to use. These OIDs definitely work with the APC AP9617 management card which plugs in the back of APC devices.

General

UPS Type             .1.3.6.1.4.1.318.1.1.1.1.1.1.0

String containing UPS model, eg: Smart-UPS 1000

Battery Information

Battery capacity             .1.3.6.1.4.1.318.1.1.1.2.2.1.0

Battery capacity as % of total

Battery temperature         .1.3.6.1.4.1.318.1.1.1.2.2.2.0

Battery temperature in Celcius of Farenheit - depending on how UPS is configured

Battery runtime remain         .1.3.6.1.4.1.318.1.1.1.2.2.3.0

Total battery runtime available based on current load.

Battery replace             .1.3.6.1.4.1.318.1.1.1.2.2.4.0

If result = 2 then battery needs replacing (1 = ok)

UPS Input

Input voltage             .1.3.6.1.4.1.318.1.1.1.3.2.1.0

Input voltage, to the UPS device

Input frequency             .1.3.6.1.4.1.318.1.1.1.3.2.4.0

Input frequency in Hz

Reason for last transfer         .1.3.6.1.4.1.318.1.1.1.3.2.5.0

String containing reason for last transfer to battery power

1  No events
2  High line voltage
3  Brownout
4  Loss of mains power
5  Small temporary power drop
6  Large temporary power drop
7  Small spike
8  Large spike
9  UPS self test
10  Excessive input voltage fluctuation

UPS Output

Output voltage             .1.3.6.1.4.1.318.1.1.1.4.2.1.0

Output voltage from the UPS

Output frequency             .1.3.6.1.4.1.318.1.1.1.4.2.2.0

Output frequency in Hz

Output load             .1.3.6.1.4.1.318.1.1.1.4.2.3.0

Output load expressed as % of capacity

Output current             .1.3.6.1.4.1.318.1.1.1.4.2.4.0

Output current in Amps

Diagnostics

Comms             .1.3.6.1.4.1.318.1.1.1.8.1.0

Whether SNMP agent is communicating with UPS device 1 = yes, 2 = no

Last Self Test result         .1.3.6.1.4.1.318.1.1.1.7.2.3.0

Result of last self test as text string. eg: pass or fail.

Last Self Test date         .1.3.6.1.4.1.318.1.1.1.7.2.4.0

Date of last self test

22 January 2005

Monitoring Cisco devices - Useful SNMP OIDs

Monitoring load average

1 minute load average:    .1.3.6.1.4.1.9.2.1.57.0
5 minute load average:    .1.3.6.1.4.1.9.2.1.58.0

These return an integer corresponding to the % load average over 1 or 5 minutes

Monitoring memory usage

5 min memory used:           .1.3.6.1.4.1.9.9.48.1.1.1.5.1   
5 min memory free:            .1.3.6.1.4.1.9.9.48.1.1.1.6.1

These return an integer corresponding to memory in bytes based on a 5 minute average

Monitoring temperature

Temperature state:     .1.3.6.1.4.1.9.9.13.1.3.1.6.1

Some devices such as 3600 series Routers only return a temperature state

Results correspond to the following states:

1               normal
2               warning
3               critical
4               shutdown
5               not present

Inlet temperature:          .1.3.6.1.4.1.9.9.13.1.3.1.3.1
Outlet temperature:        .1.3.6.1.4.1.9.9.13.1.3.1.3.3

19 January 2005

Monitoring Switches and Routers using SNMP

This post assumes a basic knowledge of SNMP and describes MIB-II OIDs that are handy for monitoring network devices - mainly switches and routers. These OIDs should be present on all SNMP capable devices.

These OIDs all sit under the MIB-II higherarchy .iso.org.dod.internet.mgmt.mib-2.system.

sysName.0

String containing system name, if configured. Useful for working out which device you are querying.


.sysLocation.0

String containing system location, if configured. Again, useful for working out which device you are querying.


.sysUpTime.0

System uptime in 1/100 of a second. Useful for detecting recently restarted equipment. This counter is actually from the time SNMP was started but usually this is analogous to system uptime.


NOTE:
For the following OIDs <int> is a integer corresponding to the interface number. So to find the description of interface three you need to query ifDescr.3


.ifDescr.<int>

String containing interface description, eg:

  • FastEthernet0/1
  • Serial0/2
  • Loopback0


.ifType.<int>

Similar to ifDescr Gives more specific technical information on interface. Eg:

  • ethernetCsmacd
  • frameRelay
  • softwareLoopback

A full list of interface types can be found here:
http://www.iana.org/assignments/ianaiftype-mib


.ifSpeed.<int>

Speed of interface in bits per second.


.ifOperStatus.<int>

Operational status of interface – up or down. Whether the interface is actually connected or not.


.ifAdminStatus.<int>

Administrative status of interface - whether the interface has been configured to be up or down. (For Cisco: shutdown / no shutdown)


.ifInUcastPkts.<int>

Number of inbound unicast packets received. An entry also exists for outbound packets: ifOutUcastPkts. For traffic statistics it is necessary to monitor the change in this value over time.


.ifInErrors.<int>

Total packet errors for this interface. Again, an equivalent entry also exists for outbound packets: ifOutErrors.


.ipInReceives.0

Total number of received IP packets


.ipInHdrErrors.0

Inbound IP packets discarded because of errors in header.


.ipInAddrErrors.0

Inbound Ip packets discarded because of addressing issues.


.ipInDiscards.0

Inbound IP packets discarded for other reasons (not header or address)


.ipOutNoRoutes.0

No route to host. High values indicate a routing issue.

And that is just about it...