CausalGame — Benchmarking Causal Thinking of LLM Agents in Games

HOW AN AGENT PLAYS CAUSALGAME

①BRIEFING

INIT-005ant 20SURVIVED

INIT-009ant 25SURVIVED

INIT-014ant 20SURVIVED

INIT-021ant 25SURVIVED

INIT-033ant 20SURVIVED

INIT-042ant 25SURVIVED

SURVIVORS ONLY

antenna ↑ ⇒ survival ↑ ?

②HYPOTHESIZE

CAUSAL AGENTH: the archive may be biased — re-fly the default design as a control. Archive said ~~100%~~…

40% — the archive lies

NAIVE AGENTH: antenna armor protects (antenna ↑ survival ↑) — accepted as fact, never to be tested.fits the correlation, tests nothing

③EXPERIMENT

do(antenna_def = 0)

85% · detected 1/20

antenna_def = 40

60% — “best batch!”never tests below ant 20

④COMMIT

CAUSAL AGENT

{ ant: 0, eng: 18, cpt: 18, … }

FINAL · ONE SHOT

mechanism: dead antenna → no emission → no detection

NAIVE AGENT

{ ant: 40, eng: 15, cpt: 15, … }

FINAL · ONE SHOT

"the sweet spot is 40 — the data supports it"

⑤VERIFY — 1,000 DRONES

926 / 1000 SURVIVED92.6% ≥ 75% — ★ MISSION SUCCESS

344 / 1000 SURVIVED34.4% < 75% — ✖ MISSION FAILED

Two agents, one trap. Each starts from a survivor-censored archive — selection bias is baked into the data before the first move.

Real benchmark sessions c6e3ba50 (antenna_trap, win) and 290feba3 (its Simpson\u2019s-paradox variant, loss) — replay both on the Game page. Animated counterpart of Fig. 1 in the paper.

Why causal thinking matters for AI scientist agents. Observational correlations can be misleading due to hidden confounders and survivorship-censored data. A naive agent that treats correlation as causation arrives at a suboptimal solution, while a causal agent identifies the underlying mechanism through active experimentation — shown above with two real benchmark runs on the antenna trap.

Abstract

Recently, it has received growing attention in building AI Scientist agents with Large Language Models (LLMs). Since scientific discovery fundamentally relies on uncovering causal relationships from observations, the capability of causal thinking that distinguishes causation from correlation and hidden biases is essential to LLM agents. Despite a number of existing benchmarks for AI scientists, they do not explicitly incorporate challenges from hidden confounders, selection bias, and noisy measurements that widely exist in real-world scientific discovery.

To this end, we present CausalGame, a benchmark that evaluates the causal thinking capabilities of LLM agents through interactive games. More specifically, we ask LLM agents to actively design experimental protocols, collect observation data, and derive a final solution with an explanation report. To emulate realistic scientific discovery challenges, we design 14 game settings with the incorporation of selection bias, noisy measurements, and hidden confounders. The results with 29 frontier LLM agents show that they consistently fail to reason about and recover the underlying causal relationships required to solve the games. CausalGame provides a controlled testbed for evaluating causal thinking of AI Scientist agents.

game scenarios

frontier LLMs

execution modes

3×

trials per cell

models above win threshold

Key findings

1 · Frontier LLMs fail at hidden causal mechanisms

All 29 models remain significantly below the optimal survival rate (~80%) under selection bias, hidden confounders, and noisy measurements — and below every scenario's win threshold.

2 · More reasoning ≠ better causal thinking

Scaling reasoning compute yields no consistent benefit: GPT-5.5-XHigh drops below GPT-5.5 / GPT-5.5-High, and DeepSeek-V3.2-Think underperforms its non-thinking counterpart.

3 · Agentic frameworks help the strongest models

Among top-10 models, iterative tool-calling (ReAct) usually beats single-turn prompting; a coding-agent framework (OpenCode) adds +6.9% on average — yet the gap to optimal persists.

4 · Agents stop at surface statistics

Rubric analysis shows 68% of agent sessions never engage in causal reasoning, and only 3.5% show nascent causal reasoning; causal-reasoning rubric scores are near-zero across all failure patterns.

Leaderboard preview

Agentic (ReAct) — top 5

1	DeepSeek V4 Pro	70.9%
2	Claude Opus 4.5	69.1%
3	Claude Opus 4.7	69.1%
4	GLM-5.1	68.9%
5	GPT-5.5 High	68.5%

Prompting — top 5

1	Claude Opus 4.7	68.6%
2	Claude Opus 4.5	67.6%
3	Qwen3.7 Max	67.4%
4	Grok 4.1 Fast	66.3%
5	Kimi K2.6	65.5%

See the full 29-model leaderboard →

Citation

@inproceedings{chen2026causalgame,
  title     = {CausalGame: Benchmarking Causal Thinking of LLM Agents in Games},
  author    = {Chen, Zhenhao and Chen, Yongqiang and Liu, Chenxi and Yu, Junchi
               and Song, Xiangchen and Li, Zijian and Li, Jialin
               and Torr, Philip and Han, Bo and Zhang, Kun},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
  note      = {Oral presentation}
}