Artificial intelligence (AI) systems have the potential to accelerate scientific progress across domains such as healthcare, education, and economics. As these systems are increasingly deployed in real-world settings, a core expectation is that they evolve beyond surface-level language fluency toward robust and reliable capabilities — including the ability to reason probabilistically about the world.
A crucial distinction emerges between two types of knowledge embedded in AI systems. Factual knowledge corresponds to definitive answers to well-established queries (e.g., “What is the capital of England?”). In contrast, probabilistic knowledge relates to uncertainties inherent in real-world phenomena (e.g., “What is the sex of a computer science graduate in the US?”).
While much of current large language model (LLM) evaluation emphasizes factual recall and correctness, less attention has been paid to whether these models encode statistical and causal patterns about the world — patterns that are critical for scientific inquiry, policy design, and societal decision-making.
To systematically describe what kinds of questions an intelligent system can answer about the world, we rely on a foundational framework known as Pearl’s Causal Hierarchy (PCH). The PCH organizes knowledge into three qualitatively distinct layers of reasoning:
These layers correspond not just to types of statistical queries, but to fundamentally different cognitive capabilities. They are not interchangeable: each layer strictly contains more information than the previous one. For example, observational data alone is insufficient to answer interventional or counterfactual questions unless supported by causal assumptions or models.
In practice, many domains — such as medicine, policy-making, and economics — rely on assumptions about the underlying causal mechanisms that govern a system. Determining these mechanisms may be infeasible in complex, partially observable environments where human behavior is involved. Yet, we typically assume such mechanisms exist, even if we cannot fully identify or express them. The PCH allows us to classify reasoning tasks according to their demands on causal knowledge, and sets limits on what can be inferred given certain inputs.
In the LLM Observatory, we are interested in capabilities across all layers, in order to understand how and whether modern AI systems — particularly large language models — exhibit capacities relevant to different types of probabilistic reasoning. In particular, based on an important known result in the literature, we can say that studying the first layer of the hierarchy is the critical first step:
Let \( \mathcal{M} \) be a structural causal model (SCM), \( P(V) \) the observational distribution, and \( \mathcal{A} \) a set of causal assumptions (e.g., a causal graph or ignorability conditions). Then:
This result is known as the Causal Hierarchy Theorem, and it highlights that both data and assumptions are necessary for causal inference.
If a model's distribution \(\tilde P(V)\) differs from the true \(P(V)\), no guarantees can be provided for the validity of the model's interventional or counterfactual inferences.
The above statement is one of the key motivations of the observatory. It shows that if the model does not have access to correct observational distributions (coming from passively observing the world), then we can provide no guarantees for the model's interventional or counterfactual inferences. In other words, knowledge on observational distributions is a necessary but not sufficient condition for any causal inferences. We next describe how we study this challenge.
The LLM Observatory is designed as a long-term effort to study the reasoning capabilities of AI systems across increasingly complex types of probabilistic and causal knowledge. A recently released benchmark evaluates observational (Layer 1) capabilities, which is critical first step in understanding capabilities in higher layers of the PCH. The Observatory is explicitly built to support future benchmarks that address interventional (Layer 2) and counterfactual (Layer 3) reasoning — capabilities required for decision-making, planning, and scientific discovery.
According to the Causal Hierarchy Theorem, reliable reasoning at higher layers is not possible without both empirical knowledge (Layer 1) and structured assumptions about the environment. Our initial results suggest that current language models are often misaligned with real-world observational data, posing fundamental challenges for advancing up the causal ladder. Yet, this also presents a clear and measurable target for model developers.
The datasets below constitute the current Layer 1 benchmark (read more here) — a structured probe into whether language models encode the kinds of population-level associations needed for robust reasoning about the world: