Border Gateway Protocol (BGP) is the routing protocol underpinning the Internet. As networks connect with each other they require a way to communicate. BGP enables the internet to exchange routing information among networks.
BGP is a very complex protocol to a point that many network outages can be traced to BGP issues, ranging from unintentional misconfigurations to intentional BGP hijacks. Recently, BGP received a fair share of attention as a result of the Facebook outage in October of this year. A BGP misconfiguration brought down Facebook, Instagram and WhatsApp for hours. Incidentally, we have also been working with several customers on BGP related issues in their environments, resulting in loss of connectivity to the Internet. In this blog post, we’ll delve into the type of BGP issues and how Indeni helps you minimize the impact of BGP issues.
How to detect BGP State?
The first challenge is that there is no predefined OID for BGP state for Check Point secure gateways. This means your SNMP-based tools are not able to monitor BGP peer relationships out-of-the-box. To overcome this challenge, Indeni executes the “show bgp peers” CLI command at a regular interval to retrieve BGP states.
The BGP finite state machines are:
This is the initial state of a BGP connection. The BGP speaker is waiting for a start event, generally either the establishment of a TCP connection or the re-establishment of a previous connection. Once the connection is established, BGP moves to the next state. In state idle, the device is currently not trying to establish a BGP session with its peer. Under normal conditions, the idle state should only be temporary. When severe error conditions persist, the session can remain idle indefinitely. This means the BGP session is in shutdown state. If the firewall is an active member in a cluster configuration, a prolonged idle state is indicative of a problem.
This is the connection phase. If the TCP connection completes, BGP will move to the OpenSent stage. If the connection does not complete, BGP goes to Active. This is a transient state and it is not considered an issue.
This indicates that the BGP speaker is continuing to create a peer relationship with the remote device. If this is successful, the BGP state goes to OpenSent. You’ll occasionally see a BGP connection flap between Active and Connect. This indicates an issue with the physical cable itself, or with the configuration.
This indicates that the BGP speaker has received an Open message from the peer. Similar to the Connect state above, this is another transient state.
Once a keepalive is received in the OpenConfirm state, the state reaches Established. The peer relationship has successfully been established and information exchange between peers can happen.
Monitoring cannot be just checking for the established state. We must factor the transient states by introducing delays to alert notifications to keep the noise level down. The state idle is particularly tricky. Besides the temporary state condition, we must also factor High Availability. When a firewall is in a passive or backup mode in a clustered environment, the idle state is the desired state. Without understanding the context, it translates to false positives. With cluster awareness, Indeni sends notifications only if an active gateway in a cluster is idle.
More than just detecting BGP State
Establishing a peer relationship with a BGP peer is just part of the equation. We also need to ensure that BGP routes are active and not “hidden”. If the peer relationship is established but the routes are marked hidden, this may be a problem. Hidden routes means the routing process knows about them because it got the information from the BGP peer, but it is not passing this information along to the routing table of the secure gateway.
Indeni takes an extra step to ensure there are active BGP routes for all the active members in a cluster. It also checks for hidden routes by collecting the number of BGP routes hidden in the routing table using the clish command “show route bgp all”.
Common causes of hidden routes
Indeni presents the most common causes of hidden routes issues as part of the standard recommended remediations to help you with troubleshooting. The common causes range from misconfigurations to complex BGP deployment scenarios.
- Missing inbound route filters configuration to accept routes from an Autonomous system. A routemap must be configured. BGP must also be configured to use the defined routemap. Review the sk87420 knowledge article for more information about BGP configuration.
- This issue might occur when there are at least three BGP peers, two of which are adjacent and in the same Autonomous System, while the third BGP peer is in a different Autonomous System. Review the sk107544 knowledge article for more information and resolution.
- Routemap or inbound-filter-policy are configured to accept all routes from the BGP Autonomous System. The BGP peer publishes the routes but it also prepends the local Autonomous System, which creates a routing loop and makes the route not preferable and in some cases unusable. Refer to the sk173204 knowledge article for more information and resolution.
Automatically Triage BGP Peer Down Issues
When a BGP peer down issue is detected, Indeni runs its own investigative steps, the same ones that are normally run manually. The steps gather additional contextual diagnostics information and perform in-depth analysis. Indeni automatically applies device-specific domain knowledge to accelerate root cause analysis. It follows a troubleshooting workflow with branches curated from industry experts. Applying domain knowledge is key to determining what relevant information needs to be collected while the problem is happening so an accurate diagnosis is possible.
This chart illustrates the troubleshooting workflow for BGP. Multiple conditions and scenarios must be considered, meaning the troubleshooting steps consist of different branches based on the configuration. As you can see from the workflow diagram, troubleshooting a layer 2 BGP connectivity is different from a layer 3 connectivity.
L2 BGP Peer Troubleshooting
For layer 2 connectivity, Indeni performs various tests including testing for unicast packets and BGP port accessibility, checking for carrier counters and examining ARP table entries. Possible root causes that can be identified are:
- Interface errors due to drops or collisions
- Link flapping
- BGP port 179 is down
L3 BGP Peer Troubleshooting
For layer 3 connectivity, Indeni performs ping tests, unicast packet reachability tests and BGP port reachability tests. Possible root causes are:
- Routing issue such as missing static routes
- BGP port 179 is down
- Firewall blocking access
Depending on the configuration and situation, Indeni walks down a different branch of the troubleshooting workflow. In some cases, we can identify the root cause and present prescriptive remediations for speedy resolution. Even if the root cause was not possible, you would have captured useful diagnostics information about the problem for escalation. With Auto-Triage, we have effectively automated remediation steps, without human intervention, further reduced time to resolution and increased efficiency.
Indeni is here to help you with any BGP related issues, whether proactively to avoid bigger problems or reactively to get you up and running quickly. Let us know if you don’t see your favourite Auto-Detect elements or you want additional troubleshooting steps.
If you are new to Indeni, let us bootstrap your network automation initiatives, download a free trial today.