Palo Alto Networks High Availability Health

High Availability, or HA, as most of you reading this know, is a method of configuring firewalls, routers or appliances for redundancy. HA plays a significant role in environments where system uptime is critical to business or customer traffic. It can be setup in various methods depending on vendor such as Active/Active or Active/Passive. These methods all include different functions and benefits.

You rely on each benefit of HA in your environment functioning 100% of the time. So, how are you sure HA is going to work properly in the event of a device failure, network failure or even during a simple OS upgrade process? It starts with proper configuration based on best practices and verifying your HA design with testing and tweaking. It also depends greatly on monitoring the health of HA.

Here, I hope to cover important ways Indeni can ensure HA is configured to follow best practices and monitored for correct functionality. I’ll also cover how Indeni differentiates itself from other monitoring tools.

How Indeni Can Help:

Indeni uses knowledge from product experts along with requests and comments from crowd sourced engineer knowledge on Indeni Crowd. These engineers have gone through some valuable lessons in the field and take part in sharing their experiences in the Indeni Crowd. From this knowledge and experience we gather important processes that could be automated.

Typically to check all of the configuration components and the state of HA health you need a combination of manual checks via CLI, GUI, and maybe some PAN-OS alerts if HA state changes. Indeni consolidates these manual efforts to automate otherwise manual tasks.

Indeni provides unique capabilities in monitoring HA device health. Sometimes just monitoring the HA health on its own is not enough.

Example:
What if something outside of the HA module itself is a problem. For instance, some HA configurations leave the standby unit interfaces in a “power-down” state on Palo Alto Networks firewalls. Those interfaces must be monitored if the unit becomes active in case an interface is or remains offline due to a switch failure, etc. You want to know if those interfaces have any sort of problem as soon as possible, right? Other monitoring solutions may not have flexibility to dynamically enable and disable alerting of interface state and performance intelligently based on HA state. Best case with many solutions is to dynamically add interfaces to be monitored. However, this will then cause false alerts for interfaces being down on the new passive unit.

How Indeni Works:

HA configuration and health have many components contributing to the data used by Indeni. Collecting this important data and determining information our system administrators should be alerted on, requires a number of discovery scripts and carefully built rules.

I will now cover some of these important discovery scripts and alert rules used for Palo Alto Networks firewalls. I will be selecting a few that help demonstrate Indeni’s capabilities. For full detail on the HA scripts and rules for Palo Alto Networks devices and other vendors covered by Indeni please see the Community Knowledge Explorer.

1. RADIUS servers used do not match across cluster members-paloaltonetworks-panos

Description:
Indeni will identify when two devices are part of a cluster and alert if the RADIUS servers they use are different.

Remediation Steps:
Review the RADIUS configuration on each device to ensure they match.

How does this work?
This script pulls the Palo Alto Networks firewall’s active configuration and extracts the configured RADIUS servers from there.

Why is this important?
Tracking the currently configured RADIUS servers on all devices is important to ensure consistent authentication and access.

Without Indeni how would you find this?
An administrator may write a script to pull this data from devices and compare against a gold configuration.

Additional context:
RADIUS servers may accidentally be misconfigured or even missing between HA peers. However, in some architecturally specific use cases such as business continuity or disaster recovery, unique RADIUS servers could be expected. The primary site RADIUS servers may not become available in the new site/environment if they are not load balanced/mirrored.

*This is going to be a unique alert to the Indeni platform. If anyone knows of other systems that monitor this situation, please comment. Also, if you have a DR/BC scenario where you wish not to receive this alert you can simply set the alert to be ignored for those devices only.

2. NTP servers used do not match across cluster members-paloaltonetworks-panos

Description:
Indeni will identify when two devices are part of a cluster and alert if the NTP servers they use are different.

Remediation Steps:
Review the NTP configuration on each device to ensure they match.

How does this work?
This script pulls the Palo Alto Networks firewall’s active configuration and extracts the configured NTP servers from there.

Why is this important?
Tracking the currently configured NTP servers on all devices is important to ensure consistent time sync.

Without Indeni, how would you find this?
An administrator may write a script to pull this data from devices and compare against a gold configuration.

Additional context:
NTP servers may accidentally be misconfigured or even missing between HA peers. However, in some architecturally specific use cases such as business continuity or disaster recovery, unique NTP servers may be required. The primary site NTP servers may not become available in the new site if they are not load balanced/mirrored.

*This is going to be a unique alert to the Indeni platform. If anyone knows of other systems that monitor this situation, please comment. Also, if you have a DR/BC scenario where you wish not to receive this alert you can simply set the alert to be ignored for those devices only.

3. DNS servers used do not match across cluster members-paloaltonetworks-panos

Description:
Indeni will identify when two devices are part of a cluster and alert if the DNS servers used are different.

Remediation Steps:
Review the DNS configuration on each device to ensure they match.

How does this work?
This script pulls the Palo Alto Networks firewall’s active configuration and extracts the configured DNS servers from there.

Why is this important?
Tracking the currently configured DNS servers on all devices is important to ensure consistent name resolution.

Without Indeni, how would you find this?
An administrator may write a script to pull this data from devices and compare against a gold configuration.

Additional context:
DNS servers may accidentally be misconfigured or even missing between HA peers. However, in some architecturally specific use cases such as business continuity or disaster recovery, unique DNS servers may be required. The primary’s DNS servers may not become available in the new site if they are not load balanced but are instead synchronized redundantly.

*This is going to be a unique alert to the Indeni platform. If anyone knows of other systems that monitor this situation, please comment. Also, if you have a DR/BC scenario where you wish not to receive this alert you can simply set the alert to be ignored for those devices only.

4. Static routing table does not match across cluster members-paloaltonetworks-panos

Description:
Indeni will identify when two devices are part of a cluster and alert if their static routing tables are different.

Remediation Steps:
Ensure the static routing table matches across devices in a cluster.

How does this work?
This script uses the Palo Alto Networks API to retrieve the current routing table (the equivalent of running “show routing route” in CLI).

Why is this important?
Capture the route entries that are statically set on the device.

Without Indeni, how would you find this?
An administrator would be able to poll this data through SNMP but additional external logic would be required to correlate the static routes table across cluster members.

Additional context:
This is extremely important in an Active/Active HA configuration. When in Active/Passive mode, this rule will not trigger. (see last line of the rule in Knowledge Explorer)

There may be certain network design scenarios where these would not match such as in a DR site. If HA spans Primary and Secondary Datacenters, they would only be identical with some type of datacenter interconnect to extend the IP subnets across sites like Cisco OTV or other failover equipment.

5. Best Practice Check for Preemption on PAN-OS

Description:
Indeni will identify when preemption is enabled within the HA peer setup.

Remediation Steps:
It is a best practice not to enable preemption.

How does this work?
This script uses the Palo Alto Networks API to retrieve HA state (the equivalent of running “show high availability all” in CLI).

Why is this important?
The reason for that is a bug/crash or other failure cause on the primary may cause the HA to bounce back and forth causing service interruptions.

Without Indeni, how would you find this?
An administrator would have to check the configuration details or manually check the output of “show high availability all’.

6. State Duration Too Long – panos-show-high-availability-state-duration

Description:
Indeni will identify when the HA state has been suspended for an extended period of time.

Remediation Steps:
It is a best to catch a firewall that has been suspended for a long duration indicating you have forgotten to enable it again.

How does this work?
This script uses the Palo Alto Networks API to retrieve HA state (the equivalent of running “show high availability state | match State:” in CLI).

Why is this important?
If the secondary is not available during a failover event it will cause a system outage.

Without Indeni, how would you find this?
An administrator would have to check the configuration details or manually check the output of “show high availability all’ or remember when and why it was suspended.

7. Cluster configuration not synced-paloaltonetworks-panos

Description:
For devices that support full configuration synchronization, Indeni will trigger an issue if the configuration is out of sync.

Remediation Steps:
Log into the device and synchronize the configuration across the cluster.

How does this work?
This script uses the Palo Alto Networks API to retrieve the status of the high availability function of this cluster and specifically the status of the config synchronization.

Why is this important?
Normally two Palo Alto Networks firewalls in a cluster work together to ensure their configurations are synchronized. Sometimes, due to connectivity or other issues, the configuration sync may be lost. In the event of a fail over, the secondary member will take over but will be running with a different configuration compared to the primary (the original active member). This can result in service disruption.

Without Indeni, how would you find this?
The status of configuration sync is visible in the web interface, as a widget on the main screen.

Summary

Indeni will alert based on best practices determined by the vendor, the community, and the knowledge experts writing the discovery scripts and rules. If you have special use cases it is best to make a request to the community or open a ticket with Indeni. If the rule is triggered on an environment where it isn’t necessarily an issue, feel free to disable the alert for that particular system.

I hope information shared here helps shed some light on the need for monitoring HA state and best practices. Not only that but I hope it offers you a sense of the purpose and uniqueness Indeni brings to the knowledge of your system health.