Network engineers believe that outages, when they happen, usually come from unexpected causes, because if a network outage can be predicted, it can be prevented, right? Not quite. The cause of your next outage has probably already happened, and you can stop it if you can find it.
In an episode of the podcast “Cautionary Tales”, Tim Hartford shared principles around system failure. First, adding redundancy can actually make a system less stable. Galileo’s last published work, Discourses and Mathematical Demonstrations Relating to Two New Sciences, contains a story of a marble column that was in storage for a building project. It had been propped up in three places: one at each end, and one in the middle. The middle support was to prevent the column sagging and breaking under its own weight. However, one of the end supports crumbled, so the end sagged, creating upward pressure in the middle. As predicted, the column cracked in the middle, but in an unexpected way. Adding redundancy caused the very problem it was designed to prevent. Secondly, Hartford related a principle put forth by sociologist Charles Perrow in his 1984 book, Normal Accidents: failures are unavoidable in systems that are both complex and tightly coupled. While Perrow’s book examines Three Mile Island, it’s easier to think of a line of dominos: each additional domino makes the system slightly more complex; every time a domino is added, it’s an opportunity to knock the next one over; and tight coupling means knocking one domino over will cascade to make the rest of them fall.
The mention of complex systems brings to mind a short treatise from 1998 by Dr. Richard I. Cook, “How Complex Systems Fail”. He observed that any complex system has guards against single points of failure, so multiple failures must occur for an overall system outage. Therefore, remediation of small problems tends to get deferred, since they don’t affect production. The organization prioritizes addressing issues that actually do affect production, like moves/adds/changes, and rewards engineers who find faster ways to get things done. Over time, as changes occur in both the system and the staff, the list of small problems is increasingly inaccurate as it fades from living memory: without periodic re-examination, there’s no detection of new potentially hazardous interactions. Therefore, when an outage does occur, an organization looking for a root cause can always claim operator error, since the root cause was something known but which had been ignored. That outage is also inevitable, since well performing systems are usually given additional workload, without any additional workforce.
Combining these principles gives the following conclusions:
- Large system failures come from interaction of small problems.
- Adding redundancy necessarily adds complexity.
- The more complex the system, the more likely it is that there are existing small problems, and the less likely it is that their interaction can be understood.
- A system that seems to be performing well will be pushed harder, until it eventually breaks and experiences an outage.
- When the outage occurs, there will be a lot of small problems (re-)discovered, many of which will not be related to the actual cause or solution, but each of which will create a distraction that adds to the time needed to restore the overall system to operation.
Here’s how Indeni avoids the trap of a complex cascade:
- Indeni is loosely coupled to the devices it checks for errors, so it does not contribute to system failure.
- Indeni uses multi-variable context in its detection elements, based on expert knowledge of complex systems, so it can detect issues that are likely to cascade.
- When Indeni detects symptoms of an issue, its Auto-Triage feature digs deeper to provide a specific diagnosis of the root cause.
- Each issue summary has an explanation of potential impact, with prioritized based on that potential rather than the current state, so it’s easy to identify which small issues should be fixed.
- Each issue summary includes a recommended remediation, so fixes can be done quickly.