Each and every organization we work with goes through the trouble of setting up a cluster of firewalls in every single critical location in the network. The cluster is there to ensure that there is no single point of failure. It normally works very well – fail overs are smooth and traffic proceeds uninterrupted. However, sometimes, a fail over doesn’t go smoothly. If the fail over is performed intentionally during a maintenance window then it’s usually easy to revert. But what if the fail over occurs spontaneously during peak-time due to an issue such as a power outage in your primary data center?
Knowing that a bad cluster fail over during peak time is one of the most stressful situations you can be in, we took the time to prepare the check list below. Hopefully, it will help you to ensure future fail overs are smooth, even during peak times.
- Routing table differences – happens a lot less now that routing tables are controlled via the SmartDashboard, but one of the top causes of fail over issues we’ve seen.
- CoreXL, SecureXL, kernel parameter differences – check that the configurations of CoreXL, SecureXL, fwkern.conf and any other .conf or .def file you may have manually changed are the same across cluster members.
- Lack of GARP support – happens more often that you’d think. If your outage only lasts 30 seconds (or so), this is probably your issue. The network equipment around your firewalls isn’t listening to the gratuitous ARPs sent out by the newly active cluster member. You may want to enable VMAC by following SK50840.
- Poor sync performance – if your sync network is slow or congested, which is especially common in DC-to-DC sync networks, your cluster members may have trouble keeping up. This gets worse the more features/blades you enable. We recommend disabling sync on short-lived connections, like HTTP. Follow sk23695.
- Wrong configuration of topology – take a close look at the results of “cphaprob -a if” on all cluster members and make sure they are the same. Don’t forget to make sure the same *cast (multicast/broadcast) is used.
- Clock mismatch – make sure the clocks are the same on all cluster members. We highly recommend using NTP.
- Hardware, software, license mismatch – you’d think this never happens, but it does. Don’t overlook this check – make sure the appliances are of the same model, the software (including hot fixes and HFA) is the same and the licenses installed are the same.
You can you can use indeni to identify all of the above issues and hundreds more. For us, it’s all about avoiding outages by pin-pointing issues before they turn critical. It takes less than an hour to install (download now) and we’ll be happy to help you do it (contact our support).
Want to see what indeni can help you uncover in your Check Point firewalls?
If you want to learn more about how indeni can help your network management workflow and achieve high availability, just fill out the form below.