Documentation

What is CausalGame?

CausalGame evaluates the causal thinking of LLM agents through interactive games. The agent acts as a drone designer who must figure out the hidden causal mechanism behind drone survival through a limited budget of experiments — under observations that are censored by survivorship, confounded by hidden variables, and corrupted by noise.

The agent allocates defense values (DEF) across seven components — engine, wing, body, cockpit, antenna, camera, gun — and can only observe the damaged conditions of surviving drones. Every scenario is governed by an underlying structural causal model (SCM): the agent can only win by recovering the true causal mechanism rather than fitting surface-level correlations.

Game protocol

Stage 1 — Exploration Budget of 200 drones and up to 10 deployment calls. Each call deploys a batch with a chosen design and returns survival outcomes plus partial observations. Historical observations are optionally available at the start.
Stage 2 — Evaluation The agent submits a single final design, evaluated on a fleet of 1,000 drones. A win means exceeding the scenario-specific threshold, set ~5–8% below the theoretical optimum.
Report The agent also submits a natural-language report explaining its design based on collected evidence; the report is scored against the true SCM with a rubric.

The 14 scenarios

Scenario Family Selection bias Hidden confounder Threshold
antenna_trap Antenna Trap 75%
antenna_trap_high_def Antenna Trap 75%
antenna_trap_local_optima Antenna Trap 75%
antenna_trap_no_history Antenna Trap 75%
antenna_trap_no_selection_bias Antenna Trap 75%
antenna_trap_simpsons_paradox Antenna Trap 75%
deployment_zone_trap_categorical Deployment Zone Trap 75%
deployment_zone_trap_categorical_high_def Deployment Zone Trap 75%
deployment_zone_trap_categorical_local_optima Deployment Zone Trap 75%
deployment_zone_trap_categorical_no_history Deployment Zone Trap 75%
deployment_zone_trap_categorical_no_selection_bias Deployment Zone Trap 75%
deployment_zone_trap_categorical_simpsons_paradox Deployment Zone Trap 75%
deployment_zone_trap_env_shift Deployment Zone Trap 75%
weather_noise Weather 55%

Antenna Trap family

A latent weather pattern affects both antenna damage and detection risk. Historical data shows drones with higher antenna DEF survive better — a spurious correlation. The true mechanism: a functional antenna emits signals that increase enemy detection, leading to combat and destruction. The optimal strategy is antenna_def = 0: storms destroy the antenna early, activating "stealth mode" and dramatically reducing detection (~82% survival). Variants add DEF-allocation pressure, local optima, no history, a no-selection-bias control, and Simpson's paradox.

Deployment Zone Trap family

Inspired by Farr's Cholera Paradox: altitude appeared protective against cholera when the true cause was water contamination at low elevations. Here an unobserved mission zone jointly determines a visible altitude band and a hidden EMI level that actually drives communication failure. Agents who chase the altitude–survival correlation upgrade engines in vain; the optimal strategy is shield_def = 25 plus the signal_filter enhancement module (~80% survival). The categorical variant requires choosing 1 of 5 modules — only signal_filter provides EMI protection.

Weather family

weather_noise adds weather-dependent observation noise (20% in rain, 5% in clear conditions) and an environment shift between exploration and evaluation: the Stage-1 optimum reverses under the Stage-2 weather distribution. Agents must separate genuine causal structure from measurement artifacts (~78% optimal, 55% threshold).

Execution modes

Aspect Agentic (ReAct) Prompting
API operations Structured tool calling Python code with client.xxx()
Data analysis Sandboxed Python execution Full code execution
Reasoning Mandatory ReAct pattern (thought → action → observation) Optional
Exploration guard Must deploy before submitting None
Tool limit Max 5 tools per turn Unlimited API calls

A third mode, OpenCode, runs a full coding-agent framework (shell access, autonomous code execution) against the same environment — it has its own tab on the leaderboard.

Rubric-based evaluation

Beyond survival rate, every session is scored by an LLM judge panel along four dimensions (16 points total):

DimensionPointsWhat it measures
Causal reasoning 11 Identifies the true mechanism, avoids traps, articulates explicit causal chains
Experimental design 2 Backs conclusions with concrete experimental evidence
Reflection quality 2 Identifies concrete errors, blind spots, unverified assumptions
Data usage 1 Links specific observed data to each claim

Three independent judge models score each session; inter-rater agreement is high (mean ICC(2,3) = 0.75).

Getting started

The benchmark code (simulation engine, agent harness, all 14 scenario configs) lives at https://github.com/viewsetting/CausalGame.

git clone https://github.com/viewsetting/CausalGame.git
cd CausalGame
pip install -r requirements.txt

# start the simulation backend
uvicorn api.app:app --port 8000

# run one agent on one scenario
python run_agent.py --model <model-name> --experiment antenna_trap

You will need API keys for the model providers you want to evaluate (see the repository README for configuration details).

Citing CausalGame

See the citation block on the home page for BibTeX.