A modern-day blessing for Site Reliability Engineers (SREs) entreats, “May the queries flow, and the pager stay silent.” This is because SREs, DevOps engineers, or support staff are constantly stressed about responding to their alert channels while keeping an eye on operational and performance dashboards to ensure their users have a good experience. Many frontline engineers are glued to dashboard monitor screens laid out in front of them. Assessing and responding to alerts is a top priority.
This approach involves both the observability tool and the observer, and they both have crucial roles. While various golden signals are continually monitored on the observability dashboards, it is up to the observer to provide the evaluation and intelligence to piece together details and know when and how to respond. This is especially apparent when there is some kind of problem. The observer has to determine what to drill down on and then where to go next in order to find the root cause. The observer is decidedly not automated, and there are finite limits to what they can take in and consider in their observations to develop proper context, validation, and, ultimately, to understand the root cause of a problem.
Concepts such as root cause automation and machine learning or AI have become standard fare in tool vendor’s marketing campaigns, and each of them has often been used loosely and with great variety. These terms are popular for a good reason. With so much riding on uptime and maintaining full functionality, pressures are even greater to resolve problems and remedy outages as quickly as possible. At the same time, problems are growing in frequency and complexity, and technical teams are increasingly overworked and understaffed.
Overall, application monitoring is experiencing unprecedented change in attempts to address this situation amidst growing customer needs and demands. Of course, symptoms with widespread impact are well captured by monitoring dashboards, such as latency problems that affect users, changes in overall traffic, or a spike in error counts. Many tools enhance telemetry collection and streamlining workflows, but this falls short of solving the real problem. The real issue is to find and solve the problem as quickly and accurately as possible. Autonomous troubleshooting, based on root cause analysis, is becoming a critical capability for meeting SLAs and SLOs, but it is still not commonly used. The technology and practices for autonomous troubleshooting are now mature and accurate, but awareness and usage are still limited to cutting-edge early adopters.
Besides the situation facing technical teams already described, a big part of the challenge is determining the exact cause of a problem. Metrics dashboards on observability tools can be effective in showing that a problem exists and can help zoom in on when it occurred. These tools often provide tracing capabilities that can also effectively help you narrow down where the problem started (which software components, infrastructure, or microservices). The biggest challenge, however, is to understand the why—what was the cause of the issue and the services involved.
Finding the why—the root cause—generally requires hunting deep within a plethora of logs or delving into millions of log events. The sheer volume involved makes it really hard to pick out unusual patterns by eye. The difficulty is compounded because finding root cause generally involves more than a single log—the proverbial needle in the haystack. Often the answer is found by understanding a linear progression across log messages from different services or associating many logs together—multiple needles in a haystack—to piece together the big picture of what has transpired to cause the problem.
It is even harder to mentally correlate these unusual patterns and errors across hundreds (or even thousands) of log streams to construct a timeline of the problem. The human brain (and eyeballs) are the true limiting factor here since it is a hunt for an unknown set of events, not the speed or scalability or one’s tool set.
This is particularly true for subtle issues, such as bugs that don’t cause catastrophic downtime but affect specific user actions. Software issues that gradually grow in severity over a period of time are also particularly difficult to investigate, even if they are just minutes apart in their occurrence. Such errors might start with innocuous-looking events, warnings, restarts, or retries and then escalate to cause damage or disruption, impacting users as downstream services start to fail. This makes finding the root cause all the more difficult.
Experienced technical personnel can successfully hunt for the root cause, but the process tends to be driven by intuition and a lot of iteration, making it slow and hard to automate. Because of increasing complexity and modern development and deployment practices, this manual expert approach is hurting efforts to drive down Mean Time to Resolution (MTTR), a key objective for engineering organizations. Additionally, the manual approach cannot scale and is inherently limited by what the human eye and brain can take in and process. Trying to spread this load out across teams is contingent upon “perfect” timely communication between each team member and what they have encountered. As the complexity and operational data grow linearly, MTTR for new or unknown complex issues grows quadratically. It does not require an advanced mathematician to understand that such a model is not sustainable and that automation is crucial.
Fortunately, both observability and the observer can be automated, and it is a very good fit for machine learning. Such automation can scale, even as volume and complexity grow. Ubiquitous access to fast compute resources and the advancement of machine learning technologies and unsupervised learning algorithms provide the means to address the challenge through automation. This in no way means replacing or reducing great technical teams. Experienced technical teams are necessary to validate or evaluate the findings and remediate the problems, as well as the myriad of other responsibilities they have in deploying, tuning, managing, and evolving sites and applications. Automated root cause functionality does not replace a technical professional but rather augments their capabilities, automating their repetitive brute force tasks, giving them far greater scale and increasing their productivity.
Unfortunately, the modern-day blessing above is really an expression of hope rather than a way of bringing clarity to the growing crisis facing organizations to maximize uptime and minimize issues. And the solution should not involve throwing more and more humans to perform the task of “observing” in order to understand the problem. Rather, the solution needs to be about automating the observer and, therefore, automating the process of root cause analysis.