Announcing the future of infrastructure health

Today I’m excited to announce our platform for infrastructure health. Before I go into what we’ve just done, let me explain why.

What’s the current status of infrastructure health?

What exactly is broken in infrastructure operations? Why are enterprises around the world still grappling with downtime?

Our research, as well as that of others, points to the human element. Over 70% of all outages are caused by human error. This is baffling – the people responsible for running the infrastructure are some of the smartest people out there. I meet them regularly, they know their job well. Many of them have a decade or more of experience in what they do. Still, mistakes occur. Why is that?

A few years ago, the Idaho National Laboratory, which is a US Department of Energy laboratory, published a short paper regarding human error in the energy industry. It is amazing to see the resemblance between human error in different industries. They point to a few critical issues causing human error:

  • Lack of time to adequately complete the task.
  • Stress – physical and mental.
  • Complexity of the task at hand.
  • Experience and training.
  • Missing or insufficient procedures and work processes.
  • Poor human-machine interaction (think – user interface).
  • Physical fitness (sick? tired?)

When we speak with those responsible for the smooth running of the world’s digital infrastructure, we are amazed to find out how many different components they are responsible for. For example, one admin maintains over 200 components spread over the entire US, all by herself. All she has at her disposal is a basic monitoring system that alerts her if any of the components have failed. To make matters even more difficult, some of these components have a slow internet connection and fall off the grid for no reason. Every day is a firefight for her, trying to figure out what’s broken and fix it as quickly as possible. She is stressed, operating under lack of time, high complexity, insufficient training, missing procedures, poor user interface (in some products) and is quite probably tired.

So, it’s clear why humans could make errors. The question is – how do you help them?

download

Many software companies have tried to answer this in the past: Monitoring systems, log management and even configuration management. Unfortunately, these systems are not adequate. They don’t make a real dent in the problem of service disruption and downtime. This is due to three reasons:

  1. Most of these solutions focus on just one element of the health of infrastructure components – running statistics (monitoring), logs, or configuration. They don’t pull all three.
  2. A handful of solutions do pull in all three elements of information, but do a poor job of correlating them. Did a specific configuration cause an issue that was visible in a log that would have helped anticipate a device failure? They don’t help the user find that out. Not ahead of time and not even in retrospect.
  3. None of the solutions, except indeni’s, actually know what to look for across these three elements of information. They are great at presenting the data they collect, but leave it to the (already overworked and thinly spread) human to make sense of it.

What did we do up to now?

A few years ago, the team at indeni came up with a solution to this. We built a software product capable of pulling running statistics, logs and configs out of infrastructure components and make sense of them. The product uses its on-board knowledge to identify what requires attention and provides alerts to the user informing them of what’s needed to be done. Like one of our customers recently said: “indeni finds stuff before it happens, and tell’s me about it.” All of a sudden, the humans operating the world’s digital infrastructure become proactive. They have a fighting chance to reduce that 70% I mentioned earlier, get it closer to zero. Another customer told me in our recent survey: “Before indeni, my infrastructure was full of issues. I’d get calls daily. With indeni, I barely get any calls and can handle issues before my boss knows. It freed up so much time. I’m now an architect after being an engineer previously.”

So, the experiment worked. Our technology delivered noticeable results. Customers are buying into it and are expanding their usage of the product at an average rate of 60% per year. Not only that, every single customer of ours told us we do not have competitors. Our technology is unrivaled in its ability to identify complicated issues before they occur.

indeni tackles the main sources of human error, as described above:

  • We free up time, in some cases hundreds of man hours per month per person, to focus on the important projects at hand.
  • We drastically reduce mental stress by solving issues before the rest of the organization knows about them.
  • We reduce complexity of tasks by pointing the source of an issue and the correct course of action.
  • We increase training – by pointing out issues the user has potentially never heard of in the past and showing them how to resolve them.
  • We improve work processes – especially the interaction between engineering and operations teams, as well as different levels of escalation within operations teams. Information is handed off in a smoother way.
  • We watch the admin’s back – so if they’re making a critical change at 2AM, due to maintenance window constraints, we’re there to identify if a mistake was made by a tired admin.

But, all this was limited to Check Point firewalls, Cisco routers, switches and firewalls, F5 load balancers and Palo Alto Networks firewalls. That’s a small subset of the entire world’s infrastructure.

indeni preemptive alert

Time to expand

A single company, no matter how big, can never tackle the problem of covering all of the possible types of infrastructure components on its own. The world is vast. Trillions of dollars worth of equipment is deployed out there. It’s impossible to even imagine how big it is.

So, we decided to build a platform. We’ve taken all of our expertise and know-how pertaining to how to correctly collect the three elements of data (running statistics, logs and configuration) and correlate them. We built a robust platform full of the capabilities we ourselves needed to build our first product. It is complicated on the inside, but the beauty of it is the simplicity with which one could write on top of it. Today, writing a check for a specific possible error for a specific infrastructure component type will take less than a few hours on our new platform. Over the next six months, we’ll shorten that to 30 minutes.

In addition, our machine learning team is building automated systems capable of writing their own checks on top of the same platform. Computers writing code! Mind boggling.

This means that on top of our platform, the world will now be able to write highly intelligent checks for any piece of infrastructure out there. Whether privately owned, or part of the public cloud.

download (2)

With our platform, human error in infrastructure operations will be a thing of the past.

Our end goal

Our goal is to have all of the world’s largest enterprises use our platform to ensure the uptime of their entire infrastructure – network, security, virtualization, compute, storage and more. To get there, we will be working closely with our customers, our partners and the wider community. This is a global effort and we’d love for you to join us by emailing the relevant contact below:

With all of us working together, infrastructure operations will never be the same.

FAQ: What happens to the current product?

The current product, capable of identifying issues in Check Point firewalls, Cisco routers & switches, and other devices, will be migrated to operate on top of the platform over the coming year. Existing customers will be able to migrate to the platform, initially without the ability to write on top of it. If you are an existing indeni customer and are interested in working with the platform, please reach out to your indeni contact person.