About the LLM Observatory

Capabilities Expected from Modern AI Systems

Artificial intelligence (AI) systems have the potential to accelerate scientific progress across domains such as healthcare, education, and economics. As these systems are increasingly deployed in real-world settings, a core expectation is that they evolve beyond surface-level language fluency toward robust and reliable capabilities — including the ability to reason probabilistically about the world.

A crucial distinction emerges between two types of knowledge embedded in AI systems. Factual knowledge corresponds to definitive answers to well-established queries (e.g., “What is the capital of England?”). In contrast, probabilistic knowledge relates to uncertainties inherent in real-world phenomena (e.g., “What is the sex of a computer science graduate in the US?”).

While much of current large language model (LLM) evaluation emphasizes factual recall and correctness, less attention has been paid to whether these models encode statistical and causal patterns about the world — patterns that are critical for scientific inquiry, policy design, and societal decision-making.

Pearl’s Causal Hierarchy & Different Types of Queries About the World

To systematically describe what kinds of questions an intelligent system can answer about the world, we rely on a foundational framework known as Pearl’s Causal Hierarchy (PCH). The PCH organizes knowledge into three qualitatively distinct layers of reasoning:

Layer 1 — Observation (Seeing): Involves reasoning from passively collected data. Models at this level can recognize patterns, associations, and correlations — e.g., “What does a symptom tell us about the disease?”
Layer 2 — Intervention (Doing): Goes beyond correlation and answers questions about the consequences of actions — e.g., “What if I take aspirin, will my headache go away?”
Layer 3 — Counterfactual (Imagining): Supports reasoning about alternate realities — e.g., “Would I still have had a headache if I hadn’t taken aspirin?”

These layers correspond not just to types of statistical queries, but to fundamentally different cognitive capabilities. They are not interchangeable: each layer strictly contains more information than the previous one. For example, observational data alone is insufficient to answer interventional or counterfactual questions unless supported by causal assumptions or models.

In practice, many domains — such as medicine, policy-making, and economics — rely on assumptions about the underlying causal mechanisms that govern a system. Determining these mechanisms may be infeasible in complex, partially observable environments where human behavior is involved. Yet, we typically assume such mechanisms exist, even if we cannot fully identify or express them. The PCH allows us to classify reasoning tasks according to their demands on causal knowledge, and sets limits on what can be inferred given certain inputs.

In the LLM Observatory, we are interested in capabilities across all layers, in order to understand how and whether modern AI systems — particularly large language models — exhibit capacities relevant to different types of probabilistic reasoning. In particular, based on an important known result in the literature, we can say that studying the first layer of the hierarchy is the critical first step:

Causal Hierarchy Theorem (Informal)

Let \( \mathcal{M} \) be a structural causal model (SCM), \( P(V) \) the observational distribution, and \( \mathcal{A} \) a set of causal assumptions (e.g., a causal graph or ignorability conditions). Then:

(a) In the absence of \( \mathcal{A} \), the observational distribution \( P(V) \) is not sufficient to identify interventional or counterfactual quantities.
(b) In the absence of \( P(V) \), the assumptions \( \mathcal{A} \) alone are not sufficient either.

This result is known as the Causal Hierarchy Theorem, and it highlights that both data and assumptions are necessary for causal inference.

The first part of the above result is commonly considered in the causal inference literature -- in short, it states that, in absence of causal assumptions, it is generally impossible to provide any guarantees for inference over interventional or counterfactual distributions. The second part states that in absence of the observational distribution, no inferences can be made for higher layers of the PCH, leading to the following important consequence:

No Observational Distribution \(\implies\) No Layer 2/3 Inference

If a model's distribution \(\tilde P(V)\) differs from the true \(P(V)\), no guarantees can be provided for the validity of the model's interventional or counterfactual inferences.

The above statement is one of the key motivations of the observatory. It shows that if the model does not have access to correct observational distributions (coming from passively observing the world), then we can provide no guarantees for the model's interventional or counterfactual inferences. In other words, knowledge on observational distributions is a necessary but not sufficient condition for any causal inferences. We next describe how we study this challenge.

How We Study the Problem

The LLM Observatory is designed as a long-term effort to study the reasoning capabilities of AI systems across increasingly complex types of probabilistic and causal knowledge. A recently released benchmark evaluates observational (Layer 1) capabilities, which is critical first step in understanding capabilities in higher layers of the PCH. The Observatory is explicitly built to support future benchmarks that address interventional (Layer 2) and counterfactual (Layer 3) reasoning — capabilities required for decision-making, planning, and scientific discovery.

According to the Causal Hierarchy Theorem, reliable reasoning at higher layers is not possible without both empirical knowledge (Layer 1) and structured assumptions about the environment. Our initial results suggest that current language models are often misaligned with real-world observational data, posing fundamental challenges for advancing up the causal ladder. Yet, this also presents a clear and measurable target for model developers.

The datasets below constitute the current Layer 1 benchmark (read more here) — a structured probe into whether language models encode the kinds of population-level associations needed for robust reasoning about the world:

American Community Survey (ACS) 2023: Demographic, economic, and housing data; focus on income, education, and employment by group.
National Health and Nutrition Examination Survey (NHANES) 2021–2023: Health and nutrition data; focus on obesity, diabetes, and diet by demographics.
Behavioral Risk Factor Surveillance System (BRFSS) 2023: Health behaviors and outcomes by state; includes exercise, chronic conditions, and impairments.
Medical Expenditure Panel Survey (MEPS) 2023: Health expenditures and insurance coverage; analysis across population groups.
National Survey on Drug Use and Health (NSDUH) 2023: Substance use and mental health data; breakdown by demographic variables.
Survey of Consumer Finances (SCF) 2022: Household finances; data on assets, debt, and homeownership by group.
General Social Survey (GSS) 2022: Social attitudes and political views; analyzed across age, sex, race, and education.
Integrated Postsecondary Education Data System (IPEDS): Degree attainment across college programs by sex.
Bureau of Labor Statistics (BLS) 2023: Occupational data; analysis by race and sex.
FBI Arrest Statistics (UCR Program): Crime and arrest data; analyzed by race and sex.