Alert fatigue is real. Just because Indeni on average has 60% more alerts than SNMP tools, it does not translate to more noise. Indeni keeps noise levels down without letting problems escape you. We strike a balance between transparency by notifying you of every potential issue without being overwhelming.
Sounds like a paradox right? But we figured out a way to do it.
Our overarching and multi-pronged approach to minimize noise:
- Apply analytics to intelligently suppress and delay alerts, or group alerts relating to a single event.
- Apply domain knowledge to every issue so we can alert intelligently.
- Be vigilant when it comes to alerting; but notify judiciously by providing a flexible alerts handling framework.
In this post, we’ll delve into the various techniques for each approach. We will share best practices customers have adopted to address alert fatigue but still maintain alert detection efficacy.
Apply data analytics to combat alert fatigue
Data analytics provides a more accurate way to see the nature of the alerts and their behavior. Through data analytics, we were able to develop effective mechanisms to improve the signal-to-noise ratio through real customer data resulting in high-fidelity alerts.
Deduplicating alerts with Cooldown
Alert deduplication is a common way to reduce alert noise. When Indeni identifies an issue, we notify you immediately. If the issue resolves itself, we will move it to a “cooldown” phase instead of moving it to “resolved”. During this cooldown period, if the same issue occurs again, we will not open a new issue thereby reducing the number of alerts.
The cooldown feature was a result of analyzing millions of data points collected by Indeni Insight across our installed base. We plotted many data points and created histograms of cool-off intervals. From our analysis, the cooldown feature can reduce up to 75% of noisy issues.
Grouping alerts with Aggregate Policy
When an issue occurs, it may have a cascading effect. Every issue, upstream and downstream triggers an alert. These alerts are related but there is no easy way to correlate them together. Once again, we applied data analytics to the problem. By analyzing historical alerts from a subset of customers, we were able to identify a cluster failover event typically has this cascading effect. By grouping the High Availability alerts together, we were able to effectively reduce alert noise.
Suppressing flapping alerts with Early Symptom
When we analyzed historical data collected by Indeni Insight across many customers, we found that flapping detection and suppression is an additional way to reduce alert noise. The Early Symptom feature automatically identifies flapping issues by noting any issue that opens and resolves multiple times within a short time window. When a flapping issue is identified and it resolves itself within the grace period, you won’t be alerted. If it remains open after the grace period has expired, you’ll receive an alert. Based on our analysis, you can expect up to 12% reduction in alerts with the Early Symptom feature.
Apply Domain Knowledge
Indeni uses knowledge from vendors and security experts who have gone through some valuable lessons in the field. By sharing their experiences we are able to make alerts relevant, actionable and accurate.
Use domain expertise to prioritize alerts
Not all alerts are created equal. Alerts should have the right priority so time sensitive alerts and high-impact alerts will be investigated first. Combining priority with the right framework (see later) to handle alerts is essential to noise reduction. For example, if a permanent VPN tunnel is down, the alert is an error. If a dynamic VPN tunnel is down, the alert is an information alert. Another great example is events relating to clusters. If a cluster is down, that is considered a critical event. But if a member of a cluster goes down, that should not be a critical event. The right priority helps you focus on the issues that matter most.
Leverage Context to reduce noise
It is important to understand the context in order to avoid false positives. For instance, some High Availability configurations leave the standby unit interfaces in a “power-down” state on Palo Alto Networks firewalls. Those interfaces must be monitored if the unit becomes active in case an interface is or remains offline due to a failover event. The ability to dynamically enable and disable alerting of interface state based on High Availability state is critical to keeping the noise level down. Best case with many solutions is to dynamically add interfaces to be monitored. However, this will then cause false alerts for interfaces being down on the new passive unit. With Indeni, we understand the context if the device is in an active or passive state. If the device is in a passive state, we do not alert on interfaces down thereby avoiding false alarms.
Apply domain knowledge to group alerts with Issue Item
We’ve a concept of issue items to group similar issues to a single alert. A great example is VPN tunnels. A firewall typically has many VPN tunnels connecting remote sites and users. Instead of alerting you on every VPN tunnel down event, we associate a tunnel as an issue item. When the first VPN tunnel goes down, we create a new issue. When the second VPN tunnel goes down, we add the 2nd VPN tunnel down event as an issue item to the same alert. Another good example is NTP servers. A firewall typically is configured with multiple NTP servers. If the second NTP server goes down, it will be added to the existing alert as an issue item to reduce alert noise.
Provide a flexible framework to handle alerts
While it’s important to have sophisticated mechanisms based on data analytics and domain expertise, it is not enough to combat alert fatigue. You also need a flexible framework to process alerts. Alerts are of different criticality and purpose. You want the right people and tools to effectively process them.
You can tune your alerts to receive fewer. As a starting point, target your alert notifications to only the relevant people. Give users the flexibility to opt-in to receive certain alerts. Tuning should be more than just changing the severity or disabling alerts, but the ability to apply granular control. Let’s take a look at a few interesting use cases where tuning alerts can reduce alert noise.
- You have a disaster recovery strategy in place. Under normal operations, many of the disaster recovery services are not available. For example, the disaster recovery BGP peer is always down, you want to be able to exclude the peer from the “BGP peer(s) down” alert. You can define an exclusion pattern that matches the BGP peer for disaster recovery. We will not generate an alert for the disaster recovery peer.
- Security scanning is a common practice in many organizations. Every time a scan happens, it triggers the failed login attempt alert. You can pre-define the list of source IP addresses of scanning servers to suppress these alerts.
- You have multiple NTP servers in your environment. In some deployments, the secondary server is only reachable when the primary is not available. You can specify a threshold so that you will only receive an alert if all the NTP servers are not reachable.
Trouble ticket workflow integration
While it’s a common practice to automatically generate trouble tickets for alerts, you should be very selective what alerts to forward. We recommend that you only forward critical alerts to ticketing tools. This is a great way to ensure time sensitive and high impact incidents are responded to immediately. We also recommend that you use reporting for alerts relating to security risks and compliance so they can be handled in bulk.
API integration with alert processing tools
You may be using a centralized alert processing tool to correlate events and deduplicate alerts to limit the number of alerts you are receiving. You can use our JSON based API to retrieve the Indeni alerts periodically. We also have out-of-the-box integration with other IT operation tools such as SIEM and event correlation solutions such as Bigpanda where you can consolidate events or alerts for processing.
We have looked at a plethora of tools and techniques based on data analytics and domain knowledge to minimize alert fatigue and false positives. This empowers you to customize your own tuning and noise reduction strategy to fit your needs. When combined together across a unified ecosystem including ticketing systems, reporting, alert processing and event correlation tools, alert noise can be significantly reduced. Between the robust integrations, tools and techniques, rest assured that the alerts are higher fidelity. We hope these practices help you reduce alert noise and improve your troubleshooting experience.