The Incident Manager within Oracle Enterprise Manager 13c is a powerful tool for monitoring a wealth of target types such as databases, hosts, middleware or just services in general from a bird’s eye view or right down into the manifold details of a dedicated application. As with all powerful means around, where there is a lot of power, there also usually is a lot to do wrong (or even employ counterproductive). This is why it is important to understand the Incident Manager concepts from bottom up and be able to identify the knobs and wheels to poke with to meet a certain requirement.
Basically, Incident Manager is a professional toolset to facilitate the management of non-critical and critical system alerts against metric values (registered as problems and incidents, see About Incidents and Problems) in a larger quantity and time scale along with alert management templating and alert assigment and so forth.
However, at the heart of any other alerting tool, take Incident Manager as well, there usually is a threshold probing engine in operation, comparing a given actual value to a row of prioritized metric thresholds (log levels are quite comparable to that, not?) and triggering (also persisting for good) an alerting message, immediately or after a couple of occurrences. It’s the art of work now, to tune the threshold and occurrence settings for any metric in such a way that you’ll receive an optimum set of information in time, i.e. not being spammed, nothing being lost, not being trapped to late. It’s a necessity, on the other hand, to fully understand your threshold probing engine walkways at hand, at a living example at best, once at least. The documentation states this, (Using Metric Extensions), let’s have a try:
Number of Occurrences Before Alert: The number of consecutive metric collections where the alert threshold is met, before an alert is raised.
The following screenshot visualizes such an example for Incident Manager where file write time metric warning and critical thresholds and occurrences have been adapted (for boringly slow systems, the times in centisecs!), lowering the probability of alerting during higher workloads. Regard the “Metric Value History” popup window, brought up by the “Table View” link, that shows the agent probing timestamps, every 10 mins, resetting any previous alert status through an actual metric value of 0, at 12:49:18 am wall clock time, step #0. From there on up to step #10, the given critical threshold has consistently being violated, triggering (and persisting) the alert as well as sending out an email notification. The alert status persists being critical throughout step #15, until in step #16, the alert status will again be reset, clearing up and yet sending out another email. The graph control shows that nicely in light red. It also shows a period of another alert status, only at a warning level this time, just before last midnight, in yellow. A well done, comprehensible representation, I like it.