Netcool/Probes: SNMP (MTTrapd) probe dropping traps

Recently we had an issue in the Netcool Production environment, where SNMP (MTTrapd) probe dropping/delaying traps.

Firewall team confirmed that there is no firewall issues between Netcool and Device,
Device monitoring team also confirmed ,these traps are successfully leaving from the device.

Finally Identified the root cause for this issue and fixed. Problem was DNS , it was doing DNS lookup on each alarms this was casuing delayed alarms into netcool, As per IBM support document set the NoNameResolution to 1 in the probe property file and Issue resolved.


Other possiblities are :
1. Trap source and destination. The majority of SNMP traffic is done over UDP. The first check, especially for dropped packets, should be done at the data source and destination. Use a packet capture tool such as wireshark or snoop to capture the SNMP traps as they leave the destination and arrive at the source. Correlation of these two data sets will confirm whether there is a problem on the network layer. Possible causes would be dropped packets on routers, or blocked packets on firewalls.
2. Number of entries in the probe's event queue. The probe stores events in a buffer prior to processing. You can set the probe to log how many items there are on the queue into the probe's log file. Set the LogStatisticsInterval to an appropriate number. The number is the value in seconds at which the probe will write to the log the total number of events sent to the queue. The log file entries will look like this;
Error: Trap queue size is xxError: Number of traps read in the last y seconds: xxError: Number of traps processed in the last y seconds: xx
... where xx is the number of traps in the buffer, and y is the number of seconds you have set the property to.
3. Events tokenised by the probe. Setting the probe to run in raw capture mode will enable you to correlate the traps seen in the packet capture, with those traps being tokenised by the probe. Raw capture mode enables you to save the complete stream of event data acquired by a probe into a file without any processing by the rules file. This can be useful for auditing, recording, or debugging the operation of a probe.
You can enable raw capture mode using the -raw command line option or RawCapture property. The RawCaptureFile, RawCaptureFileAppend, and MaxRawFileSize properties also control the operation of raw capture mode.
4. Events being sent from the probe to the ObjectServer. Running the probe in 'debug' message level will instruct the probe to write an entry in the log file every time an event is sent to the ObjectServer. This entries end with the message:"Flushing events to object server"
By capturing the full data set from the above 4 data points, you can fully correlate the traps that are being sent to the probe with those events generated by the probe. This should expose where traps are being dropped.
Load in relation to dropped or delayed traps.The most important factor when investigating load is to know the number of events over time. Using packet capture tools, it is relatively easy to correlate the amount of traps being sent to the probe over time. Bear in mind that averaging out these results over too long a time period may hide spikes in the traffic. If large amounts of traps are being sent over very short periods, followed by long periods of low traffic, then averaging out the traps may hide these spikes, leading to a false impression of the actual load the probe is under during those spikes.
Periods of high load may result in a delay between the trap being received, and the event being sent to the ObjectServer. Or the spike in traffic may be sufficient to cause some part of the data flow to overflow, and result in dropped, or lost messages. It is vital then that you use the data captured to profile the rate of incoming traffic, and plot that rate against the rates at which the probe is processing data at each data point. The resulting graphs will immediately illustrate any bottlenecks, be they at the internal queue (point 2), the rules file (see load below), or the send queue, (point 4).
The four data points discussed above will be useful in indicating load, specifically the packet capture data and the LogStatisticsInterval property. Complementary to these is a rules file statement that will log the load that the probe is under to the probe's log file. See related information at the end of this article for more information on the 'load' rules file command. This will assist you in determining how many events the probe is parsing through the log file. Resolving the problem Ways to increase performance. Please note: You must first identify the cause of the problem before considering tuning the probe. If the probe is trying to process more traps than it is capable of, then increasing the queues in which the backlog is stored may just delay the onset of the problem. However, if what is needed is a suitably sized buffer in which to store traps, while the probe processes them, and you are confident that the load will decrease for sufficient periods for these queues to be reduced, then here are two properties that you can use.
Probe property "SocketSize": The Size (in bytes) of the kernel buffer on the socket being used. This is set on a per-socket basis. A higher value increases the number of traps that the probe can handle. For UDP traps, this improves performance. The default is 8192. Note: The minimum value for the SocketSize property is 128 bytes - the default is 8192 bytes. In the majority of cases, the default size is recommended.
Probe property "TrapQueueMax": Maximum number of traps that can be queued for processing at any one time. The probe discards any traps received while the buffer is full. The default is 20000. Setting this to a larger amount increases the internal buffer to ensure that traps received during large bursts are not discarded. It is related more to service assurance than performance.