LLM Observatory

Q1: What does the benchmark measure?

⌃

Our Layer 1 benchmark measures probabilistic knowledge. Here, probabilistic is used as opposed to factual knowledge. For instance, answering questions with a known correct answer (e.g., "What is the capital of England? - London") corresponds to factual knowledge. Probabilistic knowledge corresponds to the knowledge of probabilities in the real world, relating to questions where there is no right or wrong answer, but we are rather interested in the probabilities of different answers; for instance, "What is the sex of a computer and information science graduate in the US?" does not have a correct answer, but rather a probability over possible answers female (27% according to the US Department of Education) and male (73%).

Q2: Is probabilistic knowledge the same as probabilistic reasoning?

⌃

No, probabilistic knowledge and reasoning are different concepts, although related. Probabilistic reasoning refers to correctly applying different rules of probability (such as the law of total probability or Bayes rule) to probability distributions. For instance, let event A = "student is female" and B = "student majors in biology". Given that P(A, B) = 0.015 and P(B) = 0.03, using the Bayes rule one can compute P(A | B) = 0.015 / 0.03 = 0.5, and such a computation would fall under probabilistic reasoning. Probabilistic knowledge, however, refers to knowing correct probabilities of an event P(A), or a conditional event P(A | B); for instance, knowing that 27% of computer and information science graduates in the US are female, while 73% are male. Our benchmark tests LLMs in this latter ability.

Q3: Why should I care about probabilistic knowledge in LLMs?

⌃

Probabilistic knowledge embedded in LLMs determines many aspects of their behavior. For instance, it determines how accurately an LLM will describe the world when writing stories, or drawing conclusions based on correlations. Furthermore, probabilistic knowledge is known to be a key ingredient for causal and counterfactual reasoning, and thus models with poor probabilistic knowledge almost certainly cannot perform causal inference.

Q4: How do you measure probabilistic knowledge?

⌃

Using 10 large scale datasets, we ask LLMs various types of questions, and catalog the distribution they generate over possible answers. Then, we compare this distribution to the real world. You can read more about this in the Methodology section. Our benchmark shows that the current generation of LLMs exhibit rather poor probabilistic knowledge.

Monitoring, Understanding, and Mapping Large Language Models

Layer 1 Benchmark: Leaderboard

Layer 1 Benchmark: FAQ (Short)

Latest Insights

Layer 1 Benchmark

Qualitative Model Performance

Papers & Reports