It has been some time since we last talked about SNMP trap handling, but there's been some major developments. Recall we use the perl module SNMP::Trapinfo to process a incoming trap. We think this works really well, but there was a major piece of functionality our customer wanted:
Complex calculation of whether a trap passes a test
And by complex, we mean complex. Here's an example trap:
dastardly.altinity.net
10.243.196.251
SNMPv2-MIB::sysUpTime.0 119:2:04:40.34
SNMPv2-MIB::snmpTrapOID.0 CERENT-454-MIB::remoteAlarmIndication
CERENT-454-MIB::cerent454NodeTime.0 20060814114937D
CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication notAlarmedNonServiceAffecting
CERENT-454-MIB::cerent454AlarmObjectType.9216.remoteAlarmIndication ds1
CERENT-454-MIB::cerent454AlarmObjectIndex.9216.remoteAlarmIndication 9216
CERENT-454-MIB::cerent454AlarmSlotNumber.9216.remoteAlarmIndication 2
CERENT-454-MIB::cerent454AlarmPortNumber.9216.remoteAlarmIndication port36
CERENT-454-MIB::cerent454AlarmLineNumber.9216.remoteAlarmIndication 0
CERENT-454-MIB::cerent454AlarmObjectName.9216.remoteAlarmIndication DS1-2-36-7
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 216.243.196.251
Our customer wanted to be able to say: "Give me a critical alert if cerent454AlarmState.9216.remoteAlarmIndication is not 'cleared' and the cerent454AlarmSlotNumber is greater than 5". Well, this was impossible with our previous setup. I still don't know why it is called Simple Network Management Protocol...
We sat down to think about this and then realised we probably need an arbitrary way of calculating an SNMP trap, but the last thing we wanted to do was write a syntax parser. That would involve a whole new language, all the parsing work involved, etc, etc. This would take months of work!
Looking for inspiration, we realised OpenNMS has claimed this type of functionality. We downloaded a copy and tried to install it, but hit loads of pre-requisites. We're very lazy - we should evaluate other technologies, but if it is too much of a pain to install, then we'll give up right away!
Undeterred, we went for the next best thing - their documentation! Searching around, we found the section on evaluating traps. It appears that OpenNMS have a table called events, which is a list of all the things that happened. Then there are various filters which evaluate against those events to work out whether something needs to be alerted on. SNMP traps are converted into this event format and dropped into that table.
(As an aside, Nagios holds no such processing logic. All that complicated processing is handled by the plugins. Nagios only cares about the result. This is a feature :) )
It then dawned on us the beauty part of OpenNMS' design: rules are expressed as SQL statements.
Let me repeat that again: rules are just SQL statements. If the SQL evaluates to 1, then an alert is raised, otherwise ignored. Fantastic! This does away with all the "design your own syntax" work, with a clear, recognised language! No duplication of work!
So the above requirement could be met with a rule in OpenNMS (we think! We haven't actually tried this!) that says:
(cerent454AlarmState != 'cleared') & (cerent454AlarmSlotNumber > 5)
which equates to a SQL statement like:
SELECT ipaddr
FROM ipinterface
WHERE ipaddr in (SELECT ipaddr FROM ipinterface, node
WHERE cerent454AlarmState != 'cleared'
AND ipinterface.nodeid =node.nodeid)
AND ipaddr in (SELECT ipaddr FROM ipinterface, snmpInterface
WHERE cerent454AlarmSlotNumber > 5
AND ipinterface.ipaddr = snmpInterface.ipaddr);
But we couldn't do that with SNMP::Trapinfo - no SQL database. Tacking on DBI.pm support would be terrible. But then it hit us - why not use Perl? Most sysadmins know perl syntax and it would allow useful functionality like regular expressions, which are not as powerful in SQL.
How do we express the SNMP trap variables? Well, we already have that in SNMP::Trapinfo - macros. ${CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication} evaluates as notAlarmedNonServiceAffecting in the example trap, but instead of making it a line to display, wrap it up in some perl code:
"${CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication}" eq "cleared"
(These Cerent devices also make it difficult to find a specific variable because it encodes the object index number, 9216, into the oid name. Sigh - no one said SNMP had to be Simple or consistent. To overcome this, we introduced the idea of a wildcard for an OID tuple, so the above could be written as "${CERENT-454-MIB::cerent454AlarmState.*.remoteAlarmIndication}" eq "cleared". There are some issues if there are multiple OIDs which match this name, but we assume that only one matches...)
There's a new method in SNMP::Trapinfo called eval. This evaluates the string as a snippet of perl code and gets the return code. There are three possible results that come back from the eval:
- 1 = true - the perl snippet runs and evaluates true
- 0 = false - the perl snipper evaluates as false
- undef = error - the perl code did not run correctly (most likely is syntax errors)
This last case is possible if the variable name does not exist. For instance, the expansion of '${CERENT-454-MIB::cerent454AlarmSlotNumber.*.remoteAlarmIndication} > 5' would convert to ' > 5' which is not valid perl code if the trap coming in did not contain the desired variable.
So our way of expressing the rule required is:
"${cerent454AlarmState.9216.remoteAlarmIndication}" ne "cleared" && cerent454AlarmSlotNumber.9216.remoteAlarmIndication > 5
We have a basic wrapper script that if this code returns as true, we send a passive check to Nagios.
One final thing: we have a front end application to configure the perl snippet of code. This is obviously tainted. We don't necessarily know what is contained in the code, so it could do things like "system('rm -fr $HOME')". We added on the Safe module, so now it is restricted to only running specific operators, like the comparison and regexps and mathematical functions. Good security lets us sleep at night :)
SNMP::Trapinfo is now released on CPAN. We use this for our SNMP trap processing and we think it works fantastically well. And this continues our aim of making the base portions of Opsview as solid as possible.
Comments