The Cisco Nexus family includes a generous number of different Nexus switches models to meet the demands of any Data Center (DC) environment. Indeni supports a broad range of Nexus Series Switches and Cisco Data Center technologies.
See below for background information and best practices to minimize the CPU / memory impact of a Nexus switch during the discovery, analysis as well as automation phase by Indeni.
What you need to know about Cisco Nexus Devices & Indeni
Are you a Network Administrator, System Engineer, Software Engineer, Indeni Knowledge Expert (IKEs) or Indeni Software Developers? If yes, then read on! This overview is for you. In order to run its full set of intelligent knowledge checks on a Nexus switch, Indeni need to have SSH access (via TCP 22) to the device.
It has been noticed that for a limited number of NX-OS versions the discovery, monitoring and automation of a Nexus switch can be CPU resource intensive and can affect the overall performance of the switch. The reason is the SSH protocol which is used from the Network Monitoring and/or Automation Systems like Indeni in order to get connectivity with a Network device e.g. Nexus switch. Cisco is aware of this issue and relevant bugs have been officially published and are accessible to the CISCO bug repository portal. It should be noticed that this bug affects only a very low number of NX-OS versions and Cisco has already published several NX-OS version which resolve the problem. Finally it should be mentioned that NX-OS is the operating system of the Cisco Nexus Series switches. It is a Linux-based, next-generation operating system with high availability, modularity, resiliency, and serviceability at its foundation.
Indeni & Nexus Lab Environment
Hardware and Software information about the Network devices which have been used to the lab for the Proof of Concept can be found below.
|License Type: Enterprise|
The following network diagram illustrates the Network topology as well as the private IP address allocation of the remote Network Admin user and Indeni.
A user named “indeni” has been created to the Nexus switch. The privilege level of the indeni user is configured to network-admin. There are two active sessions to the Nexus switch. The first ssh session has been initiated by the Indeni and the second ssh session by the Network Administrator.
Nexus 3048 NX-OS upgrade
Initially the Nexus 3048 is upgraded to the current recommended NX-OS Release provided by the Cisco Software Portal. The following screen capture illustrates that the 7.0(3)I4(7) is the NX-OS image with the current minimum number of bugs and caveats.
This CISCO NX-OS version has been installed to the Nexus 3048 as is illustrated below:
Nexus 3048 Discovery by Indeni
All the logs, CPU and Memory utilization of the Nexus 3048 have been reviewed and everything seems normal before the discovery phase. The CPU utilization is less than 10% and only a remote session is active to the Nexus 3048. This active session is used used by the Network Administrator.
The Nexus switch is discovered by following the simple procedure and adding it as a new device to the Indeni. The interrogation is completed successfully and the Nexus 3048 switch has been added to the Indeni platform.
The relevant info about the installed version of the switch and the active users can be provided also by the Indeni Liveconfig.
A new SSH session has been created to the nexus switch. This time the SSH session is initiated by the Indeni as is depicted to the next output (red font).
It is noticed after a couple of hours from the discovery phase that the number of sshd processes increases. This has a major impact to the CPU utilization. In particular, the SSH daemon hung after a short period of time and consumes more than 48% of the CPU utilization. This process is highlighted with red font.
The number of active ssh sessions remains the same and is still equal to two. However, the dcos_sshd process has not terminated normally.
Further investigation to this issue proves that the ssh session hung because is expecting the “close_ack” packet in order to be terminated.
It is noticed that after a couple of days that the number of hung SSH sessions to the Nexus 3048 is higher than 60! The CPU utilization to the Nexus switch is 100% and the switch becomes unstable.
It is illustrated below that the CPU utilization to the switch which is equal to 100% during the last 72 hours.
The total number of hung ssh sessions reached the 64!
The Nexus switch is almost not operational and Indeni loses the communication with the Nexus switch. Attempt to rediscover the device fails due to the fact that the CPU resources have been exhausted and cannot handle the requests by Indeni.
NX-OS Hung SSH Bug
The Cisco Bug repository is investigated in order to be identified officially reported NX-OS relevant bugs. Investigating the bug repository of the Cisco Prime and Nexus switches proves that there is a known bug with id CSCui76897.
The workaround published by Cisco does not refer to a configuration change or Software upgrade to the Prime platform side (could be also in this case the Indeni) but recommends an upgrade to the NX-OS software release (check capture below).
Further investigating this issue from the CISCO bug repository (the bug repository with full details is accessible only via CCO account) proves that this issue has officially affected several Nexus Series switches such as N5k, N7k, N9k models. The resolution to this serious problem (have been ranked up to Severity 2) in all the cases is Software upgrade to the Nexus switches or by implementing a workaround and disabling the ssh service. Definitely this workaround could be a temporary solution since the SSH protocol is secure and mandatory service for the management of the Nexus switches.
Few indicative bugs relevant to this issue are summarized to the next table. It is provided the information about this bug per device/model.
|Device||Bug Description||Bug ID||Link|
|Nexus 7k||SSH sessions are seen if client is not sending close ack.||CSCue74597||https://bst.cloudapps.cisco.com/bugsearch/bug/CSCue74597|
High cpu on N7K due to dcos_sshd
High CPU caused by dcos_sshd process
Stale SSH sessions are seen if client is not sending close ack
High CPU on N5K from dcos_sshd process
CA is not cleaning up the CLI Session with N7k & N5k
Besides, there are several active threads to Cisco Support Community discussing the issue of the Nexus switches, few indicative threads are the next:
NX-OS SSH Workaround
Cisco recommends as a temporary workaround to disable the SSH protocol. The workaround is implemented and only the telnet service is enabled to get remote access to the Nexus 3048.
It is noticed a sharp drop to the CPU utilization and the CPU pattern is stable and normal. The feature ssh is enabled again and the CPU utilization of the switch is normal only for a couple of hours.
The switch is successfully discovered by Indeni. Indeni uses the SSH protocol to discover the nexus switch.
The CPU is low but for only a couple of hours till the problem with hung sessions reappears. The next capture collected by Indeni proves that needed less than 3 hours till the problem reappears.
It was also tested another workaround to the lab nexus switch. In particular, the nexus switch was configured to timeout any inactive ssh session after 5min period of time. The NX-OS didn’t work as expected also this time and the hung sessions still exists. The relevant config can be found below:
Conclusion & Recommendations
Cisco NX-OS seems to have an issue with the ssh protocol which hung after a short period of time. This bug causes high CPU utilization and can affect various versions and models. Several Cisco bugs and community threads have been published reporting several Network Monitoring platforms (e.g. Nagios, Cisco Prime) which use the ssh protocol to collect info from the nexus switches. This fact has as impact to cause ssh hung sessions and high CPU utilization to the nexus switches. The solution that Cisco officially proposes for this problem is not related to the Network Monitoring System which could be Prime, Nagios or Indeni but an upgrade to the NX-OS Software of the switch. It is also proposed by Cisco as a temporary solution to disable ssh (!) to the Nexus switch. The ssh protocol to the N3k (lab switch) seems to not work properly even when the session is hard coded to timeout after 5 min of inactivity.
If the Indeni is deployed to analyze a Cisco Data Center Network and the customer has already another NMS which uses ssh to get statistics and metrics then this bug should not be a concern for Indeni SE. It should be mentioned that most of the already deployed large Cisco Data Centers networks have at least a NMS (e.g. Nagios, Cisco Prime DCNM) which uses ssh to collect info from the nexus. We should be more cautious if Indeni would be the first analysis or automation platform introduced to a Data Center network that collects metrics via ssh. However, the concerns will not be for the Indeni itself but for the nexus switches and in particular if have an affected NX-OS image by the aforementioned bug.
See the full release notes for this page here.
Thank you to Vasileios Bouloukos for his contributions to this article. If you found the information helpful please share on social media by clicking the share links at the top of this page.