Documentation

What is CausalGame?

CausalGame evaluates the causal thinking of LLM agents through interactive games. The agent acts as a drone designer who must figure out the hidden causal mechanism behind drone survival through a limited budget of experiments — under observations that are censored by survivorship, confounded by hidden variables, and corrupted by noise.

The agent allocates defense values (DEF) across seven components — engine, wing, body, cockpit, antenna, camera, gun — and can only observe the damaged conditions of surviving drones. Every scenario is governed by an underlying structural causal model (SCM): the agent can only win by recovering the true causal mechanism rather than fitting surface-level correlations.

Game protocol

Stage 1 — Exploration	Budget of 200 drones and up to 10 deployment calls. Each call deploys a batch with a chosen design and returns survival outcomes plus partial observations. Historical observations are optionally available at the start.
Stage 2 — Evaluation	The agent submits a single final design, evaluated on a fleet of 1,000 drones. A win means exceeding the scenario-specific threshold, set ~5–8% below the theoretical optimum.
Report	The agent also submits a natural-language report explaining its design based on collected evidence; the report is scored against the true SCM with a rubric.

The 14 scenarios

Scenario	Family	Selection bias	Hidden confounder	Threshold
`antenna_trap`	Antenna Trap	✓	—	75%
`antenna_trap_high_def`	Antenna Trap	✓	✓	75%
`antenna_trap_local_optima`	Antenna Trap	✓	—	75%
`antenna_trap_no_history`	Antenna Trap	✓	—	75%
`antenna_trap_no_selection_bias`	Antenna Trap	—	—	75%
`antenna_trap_simpsons_paradox`	Antenna Trap	✓	✓	75%
`deployment_zone_trap_categorical`	Deployment Zone Trap	✓	—	75%
`deployment_zone_trap_categorical_high_def`	Deployment Zone Trap	✓	✓	75%
`deployment_zone_trap_categorical_local_optima`	Deployment Zone Trap	✓	—	75%
`deployment_zone_trap_categorical_no_history`	Deployment Zone Trap	✓	—	75%
`deployment_zone_trap_categorical_no_selection_bias`	Deployment Zone Trap	—	—	75%
`deployment_zone_trap_categorical_simpsons_paradox`	Deployment Zone Trap	✓	✓	75%
`deployment_zone_trap_env_shift`	Deployment Zone Trap	✓	—	75%
`weather_noise`	Weather	✓	—	55%

Antenna Trap family

A latent weather pattern affects both antenna damage and detection risk. Historical data shows drones with higher antenna DEF survive better — a spurious correlation. The true mechanism: a functional antenna emits signals that increase enemy detection, leading to combat and destruction. The optimal strategy is antenna_def = 0: storms destroy the antenna early, activating "stealth mode" and dramatically reducing detection (~82% survival). Variants add DEF-allocation pressure, local optima, no history, a no-selection-bias control, and Simpson's paradox.

Deployment Zone Trap family

Inspired by Farr's Cholera Paradox: altitude appeared protective against cholera when the true cause was water contamination at low elevations. Here an unobserved mission zone jointly determines a visible altitude band and a hidden EMI level that actually drives communication failure. Agents who chase the altitude–survival correlation upgrade engines in vain; the optimal strategy is shield_def = 25 plus the signal_filter enhancement module (~80% survival). The categorical variant requires choosing 1 of 5 modules — only signal_filter provides EMI protection.

Weather family

weather_noise adds weather-dependent observation noise (20% in rain, 5% in clear conditions) and an environment shift between exploration and evaluation: the Stage-1 optimum reverses under the Stage-2 weather distribution. Agents must separate genuine causal structure from measurement artifacts (~78% optimal, 55% threshold).

Execution modes

Aspect	Agentic (ReAct)	Prompting
API operations	Structured tool calling	Python code with `client.xxx()`
Data analysis	Sandboxed Python execution	Full code execution
Reasoning	Mandatory ReAct pattern (thought → action → observation)	Optional
Exploration guard	Must deploy before submitting	None
Tool limit	Max 5 tools per turn	Unlimited API calls

A third mode, OpenCode, runs a full coding-agent framework (shell access, autonomous code execution) against the same environment — it has its own tab on the leaderboard.

Rubric-based evaluation

Beyond survival rate, every session is scored by an LLM judge panel along four dimensions (16 points total):

Dimension	Points	What it measures
Causal reasoning	11	Identifies the true mechanism, avoids traps, articulates explicit causal chains
Experimental design	2	Backs conclusions with concrete experimental evidence
Reflection quality	2	Identifies concrete errors, blind spots, unverified assumptions
Data usage	1	Links specific observed data to each claim

Three independent judge models score each session; inter-rater agreement is high (mean ICC(2,3) = 0.75).

Getting started

The benchmark code (simulation engine, agent harness, all 14 scenario configs) lives at https://github.com/viewsetting/CausalGame.

git clone https://github.com/viewsetting/CausalGame.git
cd CausalGame
pip install -r requirements.txt

# start the simulation backend
uvicorn api.app:app --port 8000

# run one agent on one scenario
python run_agent.py --model <model-name> --experiment antenna_trap

You will need API keys for the model providers you want to evaluate (see the repository README for configuration details).

Citing CausalGame

See the citation block on the home page for BibTeX.