①BRIEFING
②HYPOTHESIZE
③EXPERIMENT
④COMMIT
⑤VERIFY — 1,000 DRONES
c6e3ba50 (antenna_trap, win) and 290feba3 (its Simpson\u2019s-paradox variant, loss) — replay both on the Game page. Animated counterpart of Fig. 1 in the paper.Why causal thinking matters for AI scientist agents. Observational correlations can be misleading due to hidden confounders and survivorship-censored data. A naive agent that treats correlation as causation arrives at a suboptimal solution, while a causal agent identifies the underlying mechanism through active experimentation — shown above with two real benchmark runs on the antenna trap.
Abstract
Recently, it has received growing attention in building AI Scientist agents with Large Language Models (LLMs). Since scientific discovery fundamentally relies on uncovering causal relationships from observations, the capability of causal thinking that distinguishes causation from correlation and hidden biases is essential to LLM agents. Despite a number of existing benchmarks for AI scientists, they do not explicitly incorporate challenges from hidden confounders, selection bias, and noisy measurements that widely exist in real-world scientific discovery.
To this end, we present CausalGame, a benchmark that evaluates the causal thinking capabilities of LLM agents through interactive games. More specifically, we ask LLM agents to actively design experimental protocols, collect observation data, and derive a final solution with an explanation report. To emulate realistic scientific discovery challenges, we design 14 game settings with the incorporation of selection bias, noisy measurements, and hidden confounders. The results with 29 frontier LLM agents show that they consistently fail to reason about and recover the underlying causal relationships required to solve the games. CausalGame provides a controlled testbed for evaluating causal thinking of AI Scientist agents.
Key findings
1 · Frontier LLMs fail at hidden causal mechanisms
All 29 models remain significantly below the optimal survival rate (~80%) under selection bias, hidden confounders, and noisy measurements — and below every scenario's win threshold.
2 · More reasoning ≠ better causal thinking
Scaling reasoning compute yields no consistent benefit: GPT-5.5-XHigh drops below GPT-5.5 / GPT-5.5-High, and DeepSeek-V3.2-Think underperforms its non-thinking counterpart.
3 · Agentic frameworks help the strongest models
Among top-10 models, iterative tool-calling (ReAct) usually beats single-turn prompting; a coding-agent framework (OpenCode) adds +6.9% on average — yet the gap to optimal persists.
4 · Agents stop at surface statistics
Rubric analysis shows 68% of agent sessions never engage in causal reasoning, and only 3.5% show nascent causal reasoning; causal-reasoning rubric scores are near-zero across all failure patterns.
Leaderboard preview
Agentic (ReAct) — top 5
| 1 | DeepSeek V4 Pro | 70.9% |
| 2 | Claude Opus 4.5 | 69.1% |
| 3 | Claude Opus 4.7 | 69.1% |
| 4 | GLM-5.1 | 68.9% |
| 5 | GPT-5.5 High | 68.5% |
Prompting — top 5
| 1 | Claude Opus 4.7 | 68.6% |
| 2 | Claude Opus 4.5 | 67.6% |
| 3 | Qwen3.7 Max | 67.4% |
| 4 | Grok 4.1 Fast | 66.3% |
| 5 | Kimi K2.6 | 65.5% |
See the full 29-model leaderboard →
Citation
@inproceedings{chen2026causalgame,
title = {CausalGame: Benchmarking Causal Thinking of LLM Agents in Games},
author = {Chen, Zhenhao and Chen, Yongqiang and Liu, Chenxi and Yu, Junchi
and Song, Xiangchen and Li, Zijian and Li, Jialin
and Torr, Philip and Han, Bo and Zhang, Kun},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026},
note = {Oral presentation}
}