RMA Do’s and Don’ts for Check Point Firewalls

While reviewing our customers Check Point firewalls I’ve identified a pattern that keeps repeating itself: many issues tend to happen post an RMA.

The pattern that we observe is the following:

1. Two members of a firewall cluster are monitored successfully, alerts being issued, etc.
2. At some point, one member disappears (which indeni issues an alert for, of course).
3. Later, a new machine suddenly appears and is clearly not the old one (different SSH host key, serial number, etc.).
4. This new machine joins the cluster but there is a whole set of configuration issues.

Since this keeps happening again and again, I would like to point out some common mistakes that are made when RMAing a Check Point firewall and installing the replacement device into production .

  • Lack-of or wrong licenses – it depends on what licenses you use, but it’s possible that the licenses you had attached to the old device won’t be applicable to the new device. Keep in mind that the out-of-the-box replacement is usually provisioned with a 15-day trial license that will allow all services during that period. Once the trial period passes the device will stop to provide service.  Make sure to run cplic print -x and validate that the licenses are what you expect them to be. Trial licenses will appear as blank output from cplic.
  • Device-level configurations missing/mismatching – I highly recommend that you go over each of the following and make sure they are either identical to the other cluster  member or similar (where appropriate):
    • Routing tables (netstat -rn, show route)
    • CoreXL and SecureXL (fw ctl multik stat, fw ctl affinity, fwaccel stat)
    • Any .def and .conf files you may have manually edited, such fwkern.conf, ipassignment.conf, etc.
    • OS-level files, such as /etc/hosts, NTP, DNS, etc.
    • Interfaces (IP addresses, subnet in use, etc.)
  • Mismatching Firewall Policy – sounds crazy, but we run into this more than we’d expect. Somehow new devices are added to a cluster while running a different policy to the currently active member. Remember that policy isn’t just the rule base, it’s also the IPS signatures, VPN settings, etc.

While there are many backup solutions out there, including our own, backing up isn’t the entire solution. It is just the first step. A good backup makes sure you have the content you need in order to rebuild the box. However, no backup solution provides you with complete 1-click recovery. So please make sure to go over the above checklist.

Using indeni to identify the above,  as well as hundreds of more possible issues, will  ensure that your next RMA procedure goes flawlessly. For us, it’s all about avoiding outages by pin-pointing issues before they turn critical. It takes less than an hour to install indeni (download now) and we’ll be happy to help you do it (contact our support).

Check Point and F5© BIG-IP© LTM© Alert of the Week: RX traffic drastically reduced post fail over, possible ARP issue

ALERT concept. Business technology internet and networking concept - ALERT text on virtual screens

 

 

NOTE: The alert detailed below is given with a Check Point ClusterXL example, although F5 BIG-IP LTM is covered for this issue as well (see SOL7332).

This is a real life sample alert from indeni

Description:

A fail over was identified at Device time: Jul 18 03:02 2014 UTC, indeni time: Jul 18 03:02 2014 UTC. This device is now the active member of the cluster and in the period immediately following the fail over (3 minutes more or less) it received 0 packets compared to 104462 packets that were received by jcnj-fw2 (10.10.10.2) in a similar amount of time immediately BEFORE the fail over. This indicates the possibility that the surrounding network equipment may not be aware of the fail over on the layer 2 level.

Manual Remediation Steps:

It is possible this is caused by the fact that during a fail over the responsibility for the virtual IPs moves from one cluster member to the other and the MAC addresses change. ClusterXL issues gratuitous arps to deal with this but it may not work with your equipment. Please review SK50840 for more information.

How does this alert work?

indeni monitors the traffic passing through all members of an HA cluster. If it sees that post a failover the newly active member isn’t seeing remotely similar levels of traffic as the pre-failover active member did, the alert is triggered.

Interested in learning more? Download for free the official indeni guide to Preemptive Maintenance of Check Point Firewalls. Just fill out the form below:

[ninja_form id=5]

Software Version Mismatch Cluster Members. Palo Alto Networks Alert Guide

This is a sample from our indeni alert guide for Palo Alto Networks Firewall.

Did you know?

As part of the normal operation of your Palo Alto Networks firewall, it updates the anti-virus, application identification and threat databases. In a cluster, this is done for each member separately on their own schedule. As a result, the databases may be different at certain times.

This, of itself, is not a problem. The problem can happen if one member updates regularly while the other doesn’t at all, as described in DOC-5592. One of the ways to identify this is happening, is to look at the High Availability widget in the firewall’s dashboard, as seen to the right. If you don’t have the widget, add it using the Widgets button.

Interestingly, there is an SNMP trap that is sent out when this issue occurs. However, it is extremely noisy and appears even when everything is OK, as you can see in the comments to the DOC linked to above.

At indeni, we believe alert fatigue is a real danger – as it causes you to ignore what really matters. Therefore, we’ve added the ability to be alerted when the discrepancy is running for more than 30 minutes, as described below.

This is how the alert would look like in indeni:

Description:

This device has software bundles at versions that differ from other members of the cluster. To ensure optimal operation of the cluster, as well as cluster synchronization and fail-over (if used), these must be the same.

Mismatching software versions:

  • app-version
    app-version is at version 489-2600 on this device while the other device is at 490-2616.
  • av-version
    av-version is at version 1502-1977 on this device while other device is at 1505-1980.
  • threat-version
    threat-version is at version 489-2600 on this device while other device is at 490-2616.

Manual Remediation Steps:

Acquire and update the software bundles to resolve this discrepancy. For more information, read DOC-5592. Note that indeni waits 30 minutes before alerting, to ensure the update on the second member really was not successful, rather than delayed (as discussed in the DOC).

How does this alert work?

indeni runs 100s of checks 24/7/365 of versions of different software and update packages to identify discrepancies.

NAT connections (fwx_alloc) table limit approaching or reached

private_public

This is a real life sample alert from indeni to identify Check Point Firewalls issues.

Description:

There are 9210 NAT connections stored in the fwx_alloc kernel table while the limit is 10000. When the limit is reached, new connections may fail.

Manual Remediation Steps:

In many cases, a sudden spike in connections has been attributed to a worm or misbehaving application. If you have ruled this out, consider the solutions suggested in SK32224. Note that a higher limit may result in more memory being used, so it is recommended that changes are made gradually.

How does this alert work?

indeni constantly monitors the usage of hundreds of kernel tables. Different kernel tables are associated with different SK articles and best practices. When a kernel table nears its limit, the specific SK articles and best practices are included in an alert.

Two Check Point Cluster Members Routing Tables Differ

Check Point cluster routing table mismatch

How many times did you run into an outage that was caused by the fact that the secondary cluster member, which is now active, was missing a route?

Want to avoid routing table mismatches from happening again? Here’s a sample of an alert you’d get with indeni:

Description:

The routing tables for the following two cluster members do not match: they show different static routes. This could cause in problems during failover or under load sharing.

indeni will re-check this alert every 1 minute. If indeni determines the issue has been resolved, it will automatically be flagged as such.

Missing Routes:

  • 10.1.2.0/255.255.255.0 is routable from this device but not from CPG_01 (10.3.1.70)
  • 172.1.2.0/255.255.255.0 is routable from this device but not from CPG_01 (10.3.1.70)

Manual Remediation Steps:

Review the routing tables of both cluster members and resolve any discrepancies.

Check Point RX traffic drastically reduced post fail over possible ARP issue

NOTE: The alert detailed below is given with a Check Point ClusterXL example, although F5 BIG-IP LTM is covered for this issue as well (see SOL7332).

This is a real life sample alert from indeni

Description:

A fail over was identified at Device time: Jul 18 03:02 2014 UTC, indeni time: Jul 18 03:02 2014 UTC. This device is now the active member of the cluster and in the period immediately following the fail over (3 minutes more or less) it received 0 packets compared to 104462 packets that were received by jcnj-fw2 (10.10.10.2) in a similar amount of time immediately BEFORE the fail over. This indicates the possibility that the surrounding network equipment may not be aware of the fail over on the layer 2 level.

Manual Remediation Steps:

It is possible this is caused by the fact that during a fail over the responsibility for the virtual IPs moves from one cluster member to the other and the MAC addresses change. ClusterXL issues gratuitous arps to deal with this but it may not work with your equipment. Please review SK50840 for more information.

How does this alert work?

indeni monitors the traffic passing through all members of an HA cluster. If it sees that post a failover the newly active member isn’t seeing remotely similar levels of traffic as the pre-failover active member did, the alert is triggered.

 

Check Point Two Cluster Members Routing Tables Differ

Description:

The routing tables for the following two Check Point cluster members do not match: they show different static routes. This could cause in problems during failover or under load sharing. indeni will re-check this alert every 1 minute. If indeni determines the issue has been resolved, it will automatically be flagged as such.

Missing Routes:

  • 10.1.2.0/255.255.255.0 is routable from this device but not from CPG_01 (10.3.1.70)
  • 172.1.2.0/255.255.255.0 is routable from this device but not from CPG_01 (10.3.1.70)

Manual Remediation Steps:

Review the routing tables of both cluster members and resolve any discrepancies.

ALERT HEADLINE: FIREWALL KERNEL TABLE LIMIT APPROACHING OR REACHED

This is an example alert from indeni.

Description:

Some kernel tables are approaching their capacity. This may result in unexpected behavior and even network traffic loss. The list of tables is provided below.

indeni will re-check this alert every 1 minute. If indeni determines the issue has been resolved, it will automatically be flagged as such.

Affected Kernel Tables:

  • pdp_sessions.
    There are 90000 items in the table while the limit is 90000.

Manual Remediation Steps:

Read SK101288.

Check Point Firewall Clusters Healthy Checklist

Each and every organization we work with goes through the trouble of setting up a cluster of firewalls in every single critical location in the network. The cluster is there to ensure that there is no single point of failure. It normally works very well – fail overs are smooth and traffic proceeds uninterrupted. However, sometimes, a fail over doesn’t go smoothly. If the fail over is performed intentionally during a maintenance window then it’s usually easy to revert. But what if the fail over occurs spontaneously during peak-time due to an issue such as a power outage in your primary data center?

Knowing that a bad cluster fail over during peak time is one of the most stressful situations you can be in, we took the time to prepare the check list below. Hopefully, it will help you to ensure future fail overs are smooth, even during peak times.

  • Routing table differences – happens a lot less now that routing tables are controlled via the SmartDashboard, but one of the top causes of fail over issues we’ve seen.
  • CoreXL, SecureXL, kernel parameter differences – check that the configurations of CoreXL, SecureXL, fwkern.conf and any other .conf or .def file you may have manually changed are the same across cluster members.
  • Lack of GARP support – happens more often that you’d think. If your outage only lasts 30 seconds (or so), this is probably your issue. The network equipment around your firewalls isn’t listening to the gratuitous ARPs sent out by the newly active cluster member. You may want to enable VMAC by following SK50840.
  • Poor sync performance – if your sync network is slow or congested, which is especially common in DC-to-DC sync networks, your cluster members may have trouble keeping up. This gets worse the more features/blades you enable. We recommend disabling sync on short-lived connections, like HTTP. Follow sk23695.
  • Wrong configuration of topology – take a close look at the results of “cphaprob -a if” on all cluster members and make sure they are the same. Don’t forget to make sure the same *cast (multicast/broadcast) is used.
  • Clock mismatch – make sure the clocks are the same on all cluster members. We highly recommend using NTP.
  • Hardware, software, license mismatch – you’d think this never happens, but it does. Don’t overlook this check – make sure the appliances are of the same model, the software (including hot fixes and HFA) is the same and the licenses installed are the same.

You can you can use indeni to identify all of the above issues and hundreds more. For us, it’s all about avoiding outages by pin-pointing issues before they turn critical. It takes less than an hour to install (download now) and we’ll be happy to help you do it (contact our support).

Want to see what indeni can help you uncover in your Check Point firewalls?

 

If you want to learn more about how indeni can help your network management workflow and achieve high availability, just fill out the form below.

[ninja_form id=56]

Top 5 Issues To Look For When Troubleshooting Your Check Point Firewalls

We’ve recently taken a snapshot of alerts across all the customers using our indeni Insight service. It’s amazing to see what indeni finds in different devices, made by different vendors. I’d like to take the opportunity to share what we’ve found for Check Point firewalls in this post.

So, if you own Check Point firewalls, here are the top 5 challenges; you should look out for. We recommend printing this and taping to the wall. You’ll need it in your next outage.

Top 5 Challenges Graphic Top 5 Challenges

1. NTP misconfigured – it’s amazing how this small configuration can be wrong in so many devices. It’s quite simple really – at the point when you’ve configured the NTP server everything worked flawlessly. Then somebody changed the NTP server’s IP, or a rule in the firewall, or a route in a router, or a… (you get the idea)…and it breaks. The trouble is, you don’t know it’s broken. If you’re lucky, you find out about it in an audit. If you’re unlucky, you find yourself scratching your head wondering why the logs coming out of your firewall are completely off.

Our Recommendation: run periodic checks to make sure the clocks are correctly set on all of your devices.

2.Policy install resulting in high CPU and a cluster fail over – a policy installation is a CPU-intensive process in many cases. The high CPU that results from policy installation may in turn result in the ClusterXL functionality misbehaving. We recommend looking out for traffic loss and/or cluster fail overs during policy installations and considering following SK32488.

Our Recommendation: if you notice flaky network traffic behavior post a policy install, take a look at SK32488.

3. Communication issues between the gateways and management – these result in a variety of issues. From the loss of logs (and firewalls logging locally) to VPN tunnel being taken down due to the gateway’s inability to check the CRL (which is on the management server’s certificate authority).

Download our free ultimate runbook and learn how proactive alerting can help you manage your Check Point Firewalls

Our Recommendation: place the communication between gateways and management/log-servers on a separate, dedicated network and ensure that network isn’t touched. If it’s not possible to create this network physically, a logical one that is well communicated within the organization would help too.

4. Differences in configurations across cluster members – Check Point have been generous enough to allow its users to tune and configure every little knob in their products. The complication this presents, however, is that some configurations must be copied manually across cluster members or set differently in different members. If someone makes a change in one member and forgets to change the other, this can break. We’ve also seen many occasions where an RMA resulted in such a situation as a new device was brought on line.

Our Recommendation: don’t make changes to cluster members in the middle of the night 🙂 Seriously though, when clusters behave oddly, check routing, .def files, .conf files, kernel parameters, SecureXL configs, CoreXL configs,etc. and make sure the configurations match across the cluster.

5. Errors, drops, collisions and various traffic issues – while these are basic, you’d be surprised how easily they are missed. Errors normally result from wrong duplex settings while drops from bursty traffic or from lack of resources to handle the traffic that’s flowing (NIC resources or CPU/IRQ resources).

Our Recommendation: monitor the various interface stats closely and identify increases promptly.

Alternatively, you can use indeni to identify all of the above issues and hundreds more. For us, it’s all about avoiding outages by pin-pointing issues before they turn critical. It takes less than 45 minutes to install, no agents (download now) and we’ll be happy to help you do it (contact our support).