The Rules Tab is where you can modify and tune issues, email notifications and disable automation tasks. You should take note that all rules will default to the Global Configuration and behave on the Thresholds and Actions defined therein. This means that a critical issue may be generated around a Hard Disk which is 80% full. This is not a mistake. These thresholds were set to show you the results of the script automation and let you adjust from the rules from there.
Best Practices for Beginners
- We would recommend downloading and installing on a lab to get a feel for how Indeni will operate and practice tuning.
- We would recommend that Email notification be suppressed if you choose to deploy in a live production environment, especially if adding devices in-bulk.
- We would recommend tuning the system and leveraging device labels to maximize your experience with how the system will automate issue messaging.
- We encourage you to participate, review, and ask questions on the Crowd should you have any questions that may not have been answered in the guide.
Navigating Indeni Rules
Since many rules exist, it will be best to filter by keyword or All Categories, to drill down to device specific rules. It’s also best to try and search for generic works, such as memory or CPU, which will give you all rules in the system that contain those words.
You can keep track of what Rules have been modified if they transition from Unchanged to Changed.
If you select by Category, you can see the rules we have around a device. For example, what Rules exist to check for CPU issues in a Radware Alteon. You can also select multiple categories by simply clicking on and highlighting them. Shift+left click will remove a single selection.
You can also add a search word to further filter the Category Selection.
Once you have selected a rule to configure, you will land on the Overview Sub-Tab want to make note of the Category. If it is for All Devices, it will be marked as such. If it is a device specific rule, you will see that indicated there. It is also good to read the Summary of what the rule is attempting to do and how it can help. Under the Name of the rule, you will see a check mark. This means that the Alert is Active. If you uncheck this option, it will Disable the Rule entirely.
Please Note: You cannot delete the Global Configuration, however, once you create a new Configuration by clicking on New. New configurations will override the Global Configuration settings.
Not all rules are the same, but the structure will be. All rules will have Name (what is this), Thresholds (when should it trigger), Time Threshold (when you want to be notified), Actions (what notification to receive), Custom Instructions (what to do), and Severity (issue impact).
Multiple Rule Configuration
You can create as many rules as you want by leveraging Labels and Devices. In fact, we advise clients that have large deployments to utilize labels to better manage and tune your system. The reason you would want to have multiple rules is for situations where you would benefit from an escalating notification procession, or require more nuanced rules to uncover issues.
CPU and Multiple Rule Configuration
The best example of this is with CPU monitoring. Some devices have different number of cores, making CPU notification more nuanced than a Hard Disk that is 96% full. So how would you want to handle this situation? We have seen the best results by end-users creating labels for devices by number of cores, then adjusting the CPU Threshold to something like 90%.
Here the Name of the issue was changed to Devices with 6 Cores, to help me know what this new alert is for. The Threshold was changed to 90.0 because that is when someone should look into it. The cores was adjusted to 6, because 1 core is not concerning. The Time Threshold is same, since 6 cores running at 90% for 10 minutes is a bit concerning.
Towards the bottom you move the devices Devices/Labels you want to automate from left to right in the configuration values. The severity is set to Critical. It is always good to include instructions on how to validate and escalate.
Please Note: New rule configurations will not work if you do not move devices or labels over.
Escalating Notification Process
Unlike the CPU, where it is helpful to create multiple rules for different types of devices, there are times when want to be made aware of an issue as an FYI, but then need an alarm to go off because it about to boil over into a critical status. We have seen great success in organizations averting an outage by creating an escalating notification process around Device Temperatures.
High Temperature Escalation Configuration
In this example you can start by creating the minimum threshold you want the issue to trigger at; e.g, a Warning Notification when the devices reach 85% of the temperature at which they would probably shut down.
For a Warning, you might want to uncheck email to limit noise. We would suggest keeping Log and SNMP checked so you can leverage the reporting feature in Current and Archive sub-tabs to create an audit around what devices are reaching that threshold. If you have a large number of devices to manage, this kind of quick audit reporting can be invaluable! You can start to see important trends that you will miss in your traditional vendor logs like;
“Is this happening in a particular data center all the time?”
“Is this happening on the same devices?”
“How long did it take for the temperature to come down?”
Please Note: This Rule does not have a time threshold because it triggers as soon as the device reaches the minimum threshold defined. If the temperature does not come down, the alert will stay unresolved until it falls below the thresholds you set.
Next, create another rule to trigger an Error with an email message, so the agents are immediately notified when the device not only breached the minimum acceptable heat threshold, but increased by 5%. You could then add custom instructions for the agent to open an operations ticket to review the device, since they are receiving an email.
Finally, create third rule to send a Critical email when the device has reached a temperature nearing shut down. You can change the custom instructions to have the agent to call the Data Center directly to have the device reviewed immediately.
Please Note: Arbitrary numbers were used for this exercise so we would not recommend creating these exact rules in your live environment. Also, the All Devices label was used, but you can create labels based on data center location, device type, etc.
Here you can disable the rule by Label or Device. This allows you to get even more granular in how you want the Rule to automate your devices.
Continue reading! View next section Understanding the Analysis tab