Documentation
What is CausalGame?
CausalGame evaluates the causal thinking of LLM agents through interactive games. The agent acts as a drone designer who must figure out the hidden causal mechanism behind drone survival through a limited budget of experiments — under observations that are censored by survivorship, confounded by hidden variables, and corrupted by noise.
The agent allocates defense values (DEF) across seven components —
engine, wing, body, cockpit,
antenna, camera, gun — and can only observe
the damaged conditions of surviving drones. Every scenario is governed by an
underlying structural causal model (SCM): the agent can only win by
recovering the true causal mechanism rather than fitting surface-level correlations.
Game protocol
| Stage 1 — Exploration | Budget of 200 drones and up to 10 deployment calls. Each call deploys a batch with a chosen design and returns survival outcomes plus partial observations. Historical observations are optionally available at the start. |
|---|---|
| Stage 2 — Evaluation | The agent submits a single final design, evaluated on a fleet of 1,000 drones. A win means exceeding the scenario-specific threshold, set ~5–8% below the theoretical optimum. |
| Report | The agent also submits a natural-language report explaining its design based on collected evidence; the report is scored against the true SCM with a rubric. |
The 14 scenarios
| Scenario | Family | Selection bias | Hidden confounder | Threshold |
|---|---|---|---|---|
antenna_trap | Antenna Trap | ✓ | — | 75% |
antenna_trap_high_def | Antenna Trap | ✓ | ✓ | 75% |
antenna_trap_local_optima | Antenna Trap | ✓ | — | 75% |
antenna_trap_no_history | Antenna Trap | ✓ | — | 75% |
antenna_trap_no_selection_bias | Antenna Trap | — | — | 75% |
antenna_trap_simpsons_paradox | Antenna Trap | ✓ | ✓ | 75% |
deployment_zone_trap_categorical | Deployment Zone Trap | ✓ | — | 75% |
deployment_zone_trap_categorical_high_def | Deployment Zone Trap | ✓ | ✓ | 75% |
deployment_zone_trap_categorical_local_optima | Deployment Zone Trap | ✓ | — | 75% |
deployment_zone_trap_categorical_no_history | Deployment Zone Trap | ✓ | — | 75% |
deployment_zone_trap_categorical_no_selection_bias | Deployment Zone Trap | — | — | 75% |
deployment_zone_trap_categorical_simpsons_paradox | Deployment Zone Trap | ✓ | ✓ | 75% |
deployment_zone_trap_env_shift | Deployment Zone Trap | ✓ | — | 75% |
weather_noise | Weather | ✓ | — | 55% |
Antenna Trap family
A latent weather pattern affects both antenna damage and detection risk. Historical
data shows drones with higher antenna DEF survive better — a spurious correlation.
The true mechanism: a functional antenna emits signals that increase enemy
detection, leading to combat and destruction. The optimal strategy is
antenna_def = 0: storms destroy the antenna early, activating "stealth
mode" and dramatically reducing detection (~82% survival). Variants add DEF-allocation
pressure, local optima, no history, a no-selection-bias control, and Simpson's
paradox.
Deployment Zone Trap family
Inspired by Farr's Cholera Paradox: altitude appeared protective against
cholera when the true cause was water contamination at low elevations. Here an
unobserved mission zone jointly determines a visible altitude band and a hidden EMI
level that actually drives communication failure. Agents who chase the
altitude–survival correlation upgrade engines in vain; the optimal strategy is
shield_def = 25 plus the signal_filter enhancement module
(~80% survival). The categorical variant requires choosing 1 of 5 modules — only
signal_filter provides EMI protection.
Weather family
weather_noise adds weather-dependent observation noise (20% in rain, 5%
in clear conditions) and an environment shift between exploration and evaluation:
the Stage-1 optimum reverses under the Stage-2 weather distribution. Agents must
separate genuine causal structure from measurement artifacts (~78% optimal, 55%
threshold).
Execution modes
| Aspect | Agentic (ReAct) | Prompting |
|---|---|---|
| API operations | Structured tool calling | Python code with client.xxx() |
| Data analysis | Sandboxed Python execution | Full code execution |
| Reasoning | Mandatory ReAct pattern (thought → action → observation) | Optional |
| Exploration guard | Must deploy before submitting | None |
| Tool limit | Max 5 tools per turn | Unlimited API calls |
A third mode, OpenCode, runs a full coding-agent framework (shell access, autonomous code execution) against the same environment — it has its own tab on the leaderboard.
Rubric-based evaluation
Beyond survival rate, every session is scored by an LLM judge panel along four dimensions (16 points total):
| Dimension | Points | What it measures |
|---|---|---|
| Causal reasoning | 11 | Identifies the true mechanism, avoids traps, articulates explicit causal chains |
| Experimental design | 2 | Backs conclusions with concrete experimental evidence |
| Reflection quality | 2 | Identifies concrete errors, blind spots, unverified assumptions |
| Data usage | 1 | Links specific observed data to each claim |
Three independent judge models score each session; inter-rater agreement is high (mean ICC(2,3) = 0.75).
Getting started
The benchmark code (simulation engine, agent harness, all 14 scenario configs) lives at https://github.com/viewsetting/CausalGame.
git clone https://github.com/viewsetting/CausalGame.git
cd CausalGame
pip install -r requirements.txt
# start the simulation backend
uvicorn api.app:app --port 8000
# run one agent on one scenario
python run_agent.py --model <model-name> --experiment antenna_trap You will need API keys for the model providers you want to evaluate (see the repository README for configuration details).
Citing CausalGame
See the citation block on the home page for BibTeX.