ICML 2026 · Oral

CausalGame: Benchmarking Causal Thinking of LLM Agents in Games

Zhenhao Chen 1 * , Yongqiang Chen 1,2 * , Chenxi Liu 3 * , Junchi Yu 4 , Xiangchen Song 2 , Zijian Li 1,2 , Jialin Li 5 , Philip Torr 4 , Bo Han 3 , Kun Zhang 1,2

¹ MBZUAI · ² Carnegie Mellon University · ³ TMLR Group, Hong Kong Baptist University · ⁴ University of Oxford · ⁵ New York University, Abu Dhabi

* Equal contribution and core contributors

HOW AN AGENT PLAYS CAUSALGAME

BRIEFING

INIT-005ant 20SURVIVED
INIT-009ant 25SURVIVED
INIT-014ant 20SURVIVED
INIT-021ant 25SURVIVED
INIT-033ant 20SURVIVED
INIT-042ant 25SURVIVED
SURVIVORS ONLY
antenna ↑ ⇒ survival ↑ ?

HYPOTHESIZE

CAUSAL AGENTH: the archive may be biased — re-fly the default design as a control. Archive said 100%
40% — the archive lies
NAIVE AGENTH: antenna armor protects (antenna ↑ survival ↑) — accepted as fact, never to be tested.fits the correlation, tests nothing

EXPERIMENT

do(antenna_def = 0)
85% · detected 1/20
antenna_def = 40
60% — “best batch!”never tests below ant 20

COMMIT

CAUSAL AGENT
{ ant: 0, eng: 18, cpt: 18, … }
FINAL · ONE SHOT
mechanism: dead antenna → no emission → no detection
NAIVE AGENT
{ ant: 40, eng: 15, cpt: 15, … }
FINAL · ONE SHOT
"the sweet spot is 40 — the data supports it"

VERIFY — 1,000 DRONES

926 / 1000 SURVIVED92.6% ≥ 75% — ★ MISSION SUCCESS
344 / 1000 SURVIVED34.4% < 75% — ✖ MISSION FAILED
Two agents, one trap. Each starts from a survivor-censored archive — selection bias is baked into the data before the first move.
Real benchmark sessions c6e3ba50 (antenna_trap, win) and 290feba3 (its Simpson\u2019s-paradox variant, loss) — replay both on the Game page. Animated counterpart of Fig. 1 in the paper.

Why causal thinking matters for AI scientist agents. Observational correlations can be misleading due to hidden confounders and survivorship-censored data. A naive agent that treats correlation as causation arrives at a suboptimal solution, while a causal agent identifies the underlying mechanism through active experimentation — shown above with two real benchmark runs on the antenna trap.

Abstract

Recently, it has received growing attention in building AI Scientist agents with Large Language Models (LLMs). Since scientific discovery fundamentally relies on uncovering causal relationships from observations, the capability of causal thinking that distinguishes causation from correlation and hidden biases is essential to LLM agents. Despite a number of existing benchmarks for AI scientists, they do not explicitly incorporate challenges from hidden confounders, selection bias, and noisy measurements that widely exist in real-world scientific discovery.

To this end, we present CausalGame, a benchmark that evaluates the causal thinking capabilities of LLM agents through interactive games. More specifically, we ask LLM agents to actively design experimental protocols, collect observation data, and derive a final solution with an explanation report. To emulate realistic scientific discovery challenges, we design 14 game settings with the incorporation of selection bias, noisy measurements, and hidden confounders. The results with 29 frontier LLM agents show that they consistently fail to reason about and recover the underlying causal relationships required to solve the games. CausalGame provides a controlled testbed for evaluating causal thinking of AI Scientist agents.

14
game scenarios
29
frontier LLMs
3
execution modes
trials per cell
0
models above win threshold

Key findings

1 · Frontier LLMs fail at hidden causal mechanisms

All 29 models remain significantly below the optimal survival rate (~80%) under selection bias, hidden confounders, and noisy measurements — and below every scenario's win threshold.

2 · More reasoning ≠ better causal thinking

Scaling reasoning compute yields no consistent benefit: GPT-5.5-XHigh drops below GPT-5.5 / GPT-5.5-High, and DeepSeek-V3.2-Think underperforms its non-thinking counterpart.

3 · Agentic frameworks help the strongest models

Among top-10 models, iterative tool-calling (ReAct) usually beats single-turn prompting; a coding-agent framework (OpenCode) adds +6.9% on average — yet the gap to optimal persists.

4 · Agents stop at surface statistics

Rubric analysis shows 68% of agent sessions never engage in causal reasoning, and only 3.5% show nascent causal reasoning; causal-reasoning rubric scores are near-zero across all failure patterns.

Leaderboard preview

Agentic (ReAct) — top 5

1 DeepSeek V4 Pro 70.9%
2 Claude Opus 4.5 69.1%
3 Claude Opus 4.7 69.1%
4 GLM-5.1 68.9%
5 GPT-5.5 High 68.5%

Prompting — top 5

1 Claude Opus 4.7 68.6%
2 Claude Opus 4.5 67.6%
3 Qwen3.7 Max 67.4%
4 Grok 4.1 Fast 66.3%
5 Kimi K2.6 65.5%

See the full 29-model leaderboard →

Citation

@inproceedings{chen2026causalgame,
  title     = {CausalGame: Benchmarking Causal Thinking of LLM Agents in Games},
  author    = {Chen, Zhenhao and Chen, Yongqiang and Liu, Chenxi and Yu, Junchi
               and Song, Xiangchen and Li, Zijian and Li, Jialin
               and Torr, Philip and Han, Bo and Zhang, Kun},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
  note      = {Oral presentation}
}