Key Takeaways
- Large language models (LLMs) in observability excel at turning high-volume telemetry such as logs, traces, and metrics into concise human-readable narratives, but they lack structural system knowledge and struggle to isolate root causes in complex distributed architectures.
- Current LLM and agentic AI approaches are prone to hallucinating plausible but incorrect explanations, mistaking symptoms for causes, and ignoring event ordering, which leads to misdiagnosis and incomplete remediation.
- Causal reasoning models service and resource dependencies explicitly, accounts for event temporality, and supports inference under partial or noisy observations, enabling more accurate root cause identification.
- Causal graphs and Bayesian inference allow for counterfactual and probabilistic reasoning, which lets engineers evaluate remediation options and their likely impact before taking action.
- Integrating LLM-based interfaces with continuously updated causal models and abductive inference engines provides a practical path to reliable, explainable, and eventually autonomous incident diagnosis and remediation in cloud native systems.
The central goal of IT operations and site reliability engineering (SRE) is to maintain the availability, reliability, and performance of services while enabling safe and rapid delivery of changes. Achieving this requires a deep understanding of how systems behave during incidents and under operational stress. Observability platforms provide the foundation for this understanding by exposing telemetry data (logs, metrics, traces) that support anomaly detection, performance analysis, and root cause investigations. However, modern applications are increasingly difficult to manage as cross-service calls, event-driven workflows, and distributed data stores introduce complex and dynamic interactions.
For instance, in July 2024, a faulty configuration update in CrowdStrike’s Falcon sensor caused widespread crashes on millions of Windows systems across industries worldwide. In another case, the 2016 removal of the tiny but widely used left-pad package from npm briefly broke thousands of builds and disrupted major websites until it was restored, revealing the fragility of transitive dependencies at scale. Whether the trigger is an external contingency or a rare emergent interaction within a highly coupled system, modern IT infrastructure can experience widespread service outages due to complex cross service dependencies. Implicit dependencies, asynchronous communication, and distributed state make it challenging to pinpoint the source of incidents or understand the chain of effects across the system.
A new class of AI-based observability solutions built on LLMs is gaining traction as they promise to simplify incident management, identify root causes, and automate remediation. These systems sift through high-volume telemetry, generate natural-language summaries based on their findings, and propose configuration or code-level changes. Additionally, with the advent of agentic AI, remediation workflows can be automated to advance the goal of self-healing environments. However, such tools remain fundamentally limited in their ability to perform root-cause analysis for modern applications. LLM-based solutions often hallucinate plausible but incorrect explanations, conflate symptoms with causes, and disregard event ordering, leading to misdiagnosis and superficial fixes.
Fundamentally, LLMs operating only on observed symptoms gleaned from telemetry are attempting to deduce root causes by traversing logs and shallow topologies. LLMs lack, however, an a priori understanding of the environment as a dynamic system with evolving interdependencies. As a result, underlying issues will persist even if symptoms are partially remediated in the short term.
Effective root-cause analysis in complex, distributed systems requires understanding the causal structure of events, services, and resources. Causal knowledge and reasoning remain critical missing components in modern AI-based observability solutions. Causal knowledge is codified into causal graphs, which model inter-service and resource dependencies explicitly. By supporting counterfactual inquiry, causal inference enables root-cause isolation and systematic remediation analysis. Augmenting LLMs and agentic AI with continuously updated causal models and an abductive inference engine (which identifies the best explanation for observed symptoms using causal reasoning), offers a path toward autonomous service reliability.
In this article, we begin by outlining the strengths of LLMs and agentic AI in observability and incident management. We then examine their limitations in performing accurate root cause analysis and driving effective remediation. Next, we introduce how causal knowledge and inference engines provide the missing context for precise incident diagnosis and response. Finally, we discuss how combining causal reasoning with AI agents enables proactive incident prevention, automated remediation, and the path toward autonomous service reliability.
The Strengths and Promise of LLMs and Agentic AI
The term "AI" has become increasingly overloaded with marketing hype and public fascination. AI now applies to everything from threshold-based alerting scripts to autonomous agents capable of planning and acting across complex workflows. For simplicity, AI solutions in the observability space can be categorized as rule-based systems, LLM-based tools, and agentic AI systems. Rule-based systems include hand-crafted logic and statistical models configured to monitor baseline deviations, detect known signal patterns, and apply threshold-based alerting across logs, metrics, and traces.
LLM-based solutions leverage the generative and language-understanding capabilities of language models to support natural-language interactions with observability data. LLMs can process unstructured telemetry such as logs, traces, and alert descriptions to generate summaries, interpret errors, and create remediation plans. Agentic AI allows LLMs to act in a managed environment by providing multi-step planning, tool-assisted execution, and direct code and configuration changes. Next, we examine the specific strengths of LLMs and agentic AI in the context of observability.
LLMs are neural architectures pretrained on large-scale corpora of natural language, code, and other text-based resources. Despite being fundamentally next-token predictors trained to model the conditional probability of text, when scaled to billions of parameters and exposed to terabytes of diverse data, LLMs become highly effective at producing coherent language and supporting a wide range of language-based tasks. Modern LLMs have been further fine-tuned for instruction-following, factual recall, code generation, and domain-specific question answering, used subsequently to explain errors, answer technical questions, and generate code, scripts, and configuration changes.
In observability contexts, LLMs can interpret complex logs and trace messages, summarize high-volume telemetry, translate natural-language queries into structured filters, and synthesize scripts or configuration changes to support remediation. Most LLM solutions rely on proprietary providers such as OpenAI and Anthropic, whose training data is opaque and often poorly aligned with specific codebases or deployment environments. More fundamentally, LLMs can only produce text. They cannot observe system state, execute commands, or take action. These limitations gave rise to agentic systems that extend LLMs with tool use, memory, and control.
Agentic AI comes closest to delivering on the speculative promise of AI. In practice, agentic systems commonly follow the ReAct framework introduced by Yao et al. (2022), which integrates reasoning and action in an interleaved loop. In this setup, the LLM generates intermediate reasoning steps, selects actions such as querying tools or retrieving information, and incorporates feedback from those actions to inform the next step. This cycle of thought, action, and observation allows the system to iteratively plan, update context, and progress toward a goal. With these capabilities, LLMs are able to write applications, solve multi-step reasoning problems, generate code based on system feedback, and interact with external services to complete goal-directed tasks.
Agentic AI shifts observability workflows from passive diagnostics to active response by predicting failure paths, initiating remediations, and executing tasks such as service restarts, configuration rollbacks, and state validation. However, current agentic systems lack a priori structural and causal models of the environment, which limits their ability to anticipate novel failure modes or explain observed behavior beyond surface-level associations. While these constraints remain, agentic AI represents a necessary step toward autonomous, tool-integrated systems capable of reasoning and acting within complex managed environments.
The ultimate promise of applying agentic AI to IT operations is autonomous service reliability. An ideal system continuously monitors telemetry, identifies potential failures, evaluates impact, and applies targeted interventions with limited human oversight. Integrated into the observability and operations stack, agentic AI should function as the control layer. It reasons over system state, coordinates diagnostics, and orchestrates remediations so that services operate reliably and in alignment with defined service level objectives (SLOs), which specify availability, latency, or other performance targets. Ultimately, autonomous service reliability reduces operational complexity, accelerates incident resolution, and improves service reliability across large-scale, dynamic environments.
On The Limitations of Modern AI and the Need for Causal Reasoning
Figure 1. Example Service Map where timeouts on service S5 are caused by connection exhaustion on resource R2.
Modern service architectures often rely on shared infrastructure and layered services, where dependencies are opaque and incident signals surface far from their origin. Imagine the simple service topology shown in Figure 1, which consists of two shared resources (R1 and R2) and a set of services (S1–S5) connected through overlapping dependencies. Now consider that we observe elevated latency and request timeouts at service S5. The underlying root cause is connection exhaustion on resource R2, which intermittently blocks new connections. This condition is not surfaced through direct telemetry because service S2, which depends on resource R2, reports only latency and timeouts without exposing the underlying resource-level failure. Tracing the issue upstream, service S3 shows increased request latency, and service S2 exhibits degraded performance. Meanwhile, service S1 also reports elevated CPU usage and latency, though its downstream service S6 remains unaffected.
Figure 2. An example of how Agentic AI would diagnose the timeouts on S5.
An LLM-based agent begins with observable symptoms (see Figure 2). It queries telemetry, inspects logs, and follows trace spans starting at S5. Along the path through S3 and S2, it observes anomalies based on latency and request failures. It also notices performance degradation on S1 and considers it a potential contributor. The agent restarts S1 and S2 and observes that the latency and timeouts at S5 are mitigated, leading it to conclude the issue is resolved. However, the problem resurfaces once connection limits on resource R2 are hit again. This scenario illustrates two key challenges.
First, spurious signals can misdirect diagnosis by drawing attention to unrelated events. Second, some root causes cannot be directly observed through telemetry and must instead be inferred from incomplete or indirect symptoms. For instance, in this scenario, the connection exhaustion on R2 emits no direct signal. It must be inferred by reasoning over the observed symptoms across S2, S3, and S5 in combination with structural knowledge of the system. Therefore, resolving such incidents requires principled causal reasoning and structural causal knowledge.
Figure 3. Example of causal resources required for abductive reasoning.
Causal knowledge maps the various causal relationships typically found in modern service architectures. The architectural knowledge provides a structural understanding of the various resources, services, and data dependencies. Finally the causal graphs provide a probabilistic framework for inferring root causes given the observed symptoms in context of the architecture.
Causal knowledge represents the relationship between root causes and their observable symptoms. Such knowledge can be abstracted to support principled downstream reasoning over system behavior such as fault localization, impact analysis, and proactive mitigation. Causal graphs provide a formal structure for encoding this knowledge. Popularized by Judea Pearl, causal graphs are directed acyclic graphs that represent cause-effect relationships among variables. When applied to reliability engineering, causal graphs describe how specific failure conditions (e.g., memory exhaustion, resource saturation, lock contention, etc.) produce observable symptoms (e.g., latency, connection errors, service timeouts, etc.).
Unlike telemetry signals capturing only runtime observations or dependency graphs, which represent observed service call relationships, causal graphs provide a richer structural understanding of how faults propagate across services and resources. This knowledge can be utilized to support inferential reasoning and identify root causes from partially observed symptoms, even when the underlying issue is not directly visible. This can be accomplished by abductive causal reasoning.
Abductive causal reasoning provides a principled, logical, and inferential framework for identifying the most likely explanation for observed symptoms. Provided a set of plausible candidate root causes and a graphical model of their symptoms, abduction selects the cause that best accounts for the observed evidence. This approach offers several advantages for modern applications: it supports inference under partial observability, selects explanations based on causal sufficiency, and provides formal guarantees about the inferred root cause. When paired with causal Bayesian graphs extending causal graphs with probabilistic reasoning, abductive inference becomes tractable in large systems. These graphs encode prior knowledge about potential root causes and their associated symptoms, allowing the system to compute the most probable root cause without requiring prior training. Furthermore, likelihoods can be updated over time using posterior observations, enabling continuous refinement in dynamic environments.
Figure 4. A simplified example of the abductive reasoning where Causal Bayesian networks are utilized to infer the most probable root cause based on observed and unobserved symptoms. Prior probabilities capture the expected symptoms associated with a given root cause and likelihood is computed based on the aggregate observations to estimate which root cause best explains the observed symptoms.
Let’s revisit our scenario with the abductive reasoning framework (Figure 4). The framework begins with a defined set of all possible root causes. Each root cause is represented by a causal graph that maps the cause to its expected symptoms with associated prior probabilities. These priors capture the likelihood of each symptom occurring if the root cause is true, derived from historical incidents and domain expertise. The abductive process identifies which root cause best explains the observed symptoms while also accounting for those that are unobserved. It computes a likelihood score for each candidate by updating the prior probabilities of its associated symptoms with observational data.
In our scenario, the observed symptoms include timeouts at service S5 and latency at services S2 and S3, which align with the causal graph for connection exhaustion on resource R2. The graph also expects latency on S4; although it is not observed, that absence is included in the likelihood estimation. Competing explanations, such as CPU starvation on S1 or network congestion on S3, receive lower likelihoods because they either fail to explain all observed symptoms or have too few expected symptoms confirmed observationally. The key distinction between abductive causal reasoning and deductive reasoning is that abduction evaluates all candidate root causes against both the observed symptoms and the expected symptoms defined by the causal models.
The agent-based approach missed the actual root cause because it lacked structural causal knowledge and stopped at the first plausible symptom path. In contrast, the abductive process leverages causal models and Bayesian graphs to infer the most likely root cause even when observations are incomplete or include spurious symptoms. Abductive causal reasoning uses these structured probabilities to isolate the most coherent explanation despite partial observations and spurious symptoms.
Limitations of Causal Reasoning
While causal reasoning is a powerful approach, it does have limitations. Constructing causal models requires significant domain knowledge and ongoing effort. The graphs must accurately capture service and resource dependencies and need regular updates as architectures evolve. Coverage is another constraint. A reasoning engine can only work with root causes that are defined in the model. If a candidate cause is missing, the engine has no basis to infer it, which reduces its effectiveness in novel or poorly understood failure cases. There are also computational challenges. In large distributed environments, some root causes map to broad sets of symptoms. Computing conditional probabilities and running Bayesian inference across these sets can become costly, especially when multiple competing explanations must be evaluated in real time.
While these limitations do not diminish the value of causal reasoning, but they do highlight that such methods are more effective under specific circumstances. In service architectures with linear and well-understood dependencies, the root cause of an incident is often self-apparent, and a reasoning engine may not be required. The challenge arises in complex distributed environments where cross-service dependencies, asynchronous communication, and distributed state make it difficult to trace symptoms back to their causes. Humans can resolve these incidents, but the process is slow, resource-intensive, and often results in longer periods of downtime and service unavailability. Therefore, meeting this challenge requires developing more effective solutions that integrate causal reasoning with AI, moving beyond current limitations and enabling progress toward autonomous service reliability.
Towards the promise of autonomous service reliability
Causal knowledge is essential for diagnosing incidents in modern service architectures. LLM-based solutions and current agentic AI often miss the forest for the trees. These systems operate over logs, traces, and metrics, but lack the structural context required to reason comprehensively about system-level behavior. While LLMs can provide useful insights by aggregating and parsing telemetry, they are ultimately bounded by the inputs they receive. With incomplete observations, LLMs tend toward speculative explanations that often result in hallucinations. Without a causal understanding of how failures propagate through services and infrastructure, modern AI solutions cannot move beyond summarization of observed events to reliably identify underlying causes. Causal knowledge and abductive causal reasoning provide the missing keys that, when combined with LLMs and agentic AI, unlock effective identification of likely root causes from incomplete and partial observations in complex, distributed environments.
Therefore, it is necessary to augment modern LLMs with a causal reasoning engine to achieve effective incident analysis and autonomous reliability. A causal reasoning engine combines three key components: causal models that encode known root causes and their associated symptoms, Bayesian causal graphs that apply probabilistic reasoning over service topologies, and an abductive inference engine that selects the most likely root cause given partial or noisy observations. This engine can operate as an external reasoning layer, providing structured causal context that modern LLMs lack.
Drawing from the principles of neuro-symbolic reasoning, the LLM serves as the flexible language interface, while the causal reasoning engine verifies hypotheses, refines candidate root causes, and performs advanced reasoning that goes beyond the LLM’s predictive text generation. By integrating this engine, LLM-based agents can transition from surface-level incident triage to precise root-cause identification and actionable remediation, constructing a path toward proactive, autonomous service reliability.
Causal agents provide a meaningful step in the pursuit of autonomous service reliability. By integrating causal knowledge and abductive reasoning, these systems move beyond reactive response and manual triage. They enable proactive incident prevention by identifying emerging risks, support targeted remediation through structural awareness, and drive self-healing by isolating and addressing root causes without human intervention. The result is a system that not only detects symptoms but understands their context. This shift reduces downtime, accelerates resolution, and aids teams in managing reliability at scale with minimal manual intervention.