F5 LTM Load Balancing Methods: How to Reset Device Trust.
The official F5 SOL13946 provides information on troubleshooting device clustering and configuration sync for 11v F5 load balancers and other products, however it is rather long winded. This guide is designed as a quick reference when troubleshooting device clustering or config sync. An overview of the config sync process for version 9.x and 10.x units can be found in F5 SOL7024
- Communication between machines occurs in the following manner to form a device cluster:
mcpd process on the local machine connects to the tmm process on the local machine on port 6699
- tmm process then contacts the peer’s config sync IP on port 4353
- Once the peer receives, they use tmm to contact mcpd over port 6699 on their local device.
- If this process fails, it is re-attempted every 5 seconds.
- If this process succeeds, there is a mesh between peer mcpd processes.
* local machine here refers to the self IP configured for config sync. Check it under Device Management > Devices > click on device > Device Connectivity > Config Sync, for example.
Configuration sync itself occurs over TCP port 443.
Units will decide they need to sync after any change to a configuration file; the timestamp is updated which updates the “Commit ID” value. The unit with the latest Commit ID is thus considered to have the most recent version of the configuration.
Network failover occurs over UDP port 1026.
F5 LTM Load Balancers Troubleshooting Methods:
- Identify the exact problem. Does config sync work but failover does not? Or does failover work but sync fails?
- If network failover is not enabled and no hardware cable is in place for failover, expect both units to be active.
- Double check the configuration, in particular config sync IP’s, management IP’s, a mistake here will prevent anything from working. You can check the config sync IP with:
# tmsh list /cm device configsync-ip
Then try pinging it for instance.
- All devices in the cluster must be running the same software and HF version. If you are in the middle of an upgrade and therefore the units are on different versions, do not attempt to sync the configurations; you should not be making configuration changes during an upgrade!
- Any devices between the units must be checked, i.e are switch ports down? Is there an intermediary firewall dropping traffic? Remember, device group members should be able to communicate over ports TCP 443(config sync), TCP 4353(mesh), UDP 1026 (network failover).
- Perform an action (whether it be config sync or failover) then check logs:
# tail -f /var/log/ltm # grep -i configsync /var/log/ltm # grep -i cmi /var/log/ltm
- Ensure necessary daemons mcpd, sod, devmgmtd and tmm are running:
# bigstart status
If any are not running, you can try to start or restart them as follows (change the daemon name as appropriate):
# bigstart start tmm # bigstart restart mcpd
Note that restarting any of these processes will likely be traffic affecting!
- Check the devices are listening on port 6699:
# netstat -pan | grep -E 6699
- Ensure times are accurate on both machines, variance can cause failures with device trust or config sync. See F5 SOL3122 on how to add an NTP server via the GUI.
- If everything looks like it should be working but isn’t, proceed to reset device trust as below:
- Force the standby unit of the pair offline via Device Management>Devices>(select device)>Force Offline.
- Device Management>Device Trust>Reset Device trust.
- Choose to create a new self-signed certificate (you can re-use the old one but creating a new one is recommended) – create a CSR here if necessary (when using CA signed certs).
- Perform the above step on both units (reset device trust, create new certificate).
- Device Management>Device Trust>Peer List>Add…. and add in details of the other unit in the pair (you need the management address and username/password for GUI). Click ‘Finished’ when done. Recommend to do this from the Active unit.
- Device Management>Device Groups and select the relevant device group. Then move the 2nd device from ‘available’ to ‘includes’ and ‘Update’. Repeat this step on both units if necessary. It shouldn’t be, but do check!
- Remove the ‘Offline’ status from the offline device, ensure one unit remains ‘active’ and the other ‘standby’ (force a device offline/standby if it does not).
- Device Management>Overview – select one of the units (usually self) and sync to/from group as appropriate, then ‘Sync’ to perform the sync.
Note that depending on the exact version you are running, there might be minor differences in the GUI, for example menu items may be in a slightly different place, but the above steps still apply.
The above steps also assume two devices in the cluster, amend appropriately if you have more than two.
These steps also apply to version 9.x, but if you are running that version, please upgrade. An upgrade guide is available here.
There is no ‘device trust’ to reset as such in version 10 and you are limited to a two device High Availability pair. Configuration issues are the common issue preventing config sync/failover from working.
- Ensure management IP and unicast/multicast addresses are correctly set. Commonly, unicast is used on a separate vlan self IP, or using the internal vlan self IP. It is set under System > High Availability > Network Failover.
- Ensure communication is possible on config sync IP addresses, over port 443. For example, ping between devices.
- Ensure management port IP’s can communicate with each other.
- Ensure communication is possible on failover IP’s, port 1026 (UDP).
- Ensure the ‘redundancy state preference’ (see System > High Availability > Redundancy) is set correctly. For an active/standby pair, ensure one device is set to ‘active’ and the other to ‘standby’, or both are set to ‘none’. No other combination is correct.