The Game

You are an advanced drone designer. Command has tasked you with optimizing drones for survival in a hostile canyon patrolled by enemy radar and anti-air defenses. The simulation is a black box: nobody hands you the rules — you must discover them through observation and experimentation, on a budget, before committing to one final design for the whole fleet.

Selection bias: survivors in the archive have broken antennas — because a live antenna is a radar beacon. The optimal design sacrifices it.

The CausalGame world. Weather, enemy detection, and component damage are governed by a hidden structural causal model — and the data you get to see is censored, confounded, and noisy.

Your budget

historical flights (briefing)

200

drones for exploration

≤10

deployment calls

final submission (irreversible)

1,000

drones in the evaluation fleet

Budgets shown for antenna_trap (from game.json); variants adjust them — e.g. no_history scenarios start with zero historical flights.

How a game unfolds

Historical data

The mission starts with logs of up to 50 past flights — but only drones that came back are in the logs. The selection bias begins before your first move.

50 flightssurvivors only

The exploration loop is where causal thinking happens: deploy → observe → analyze → hypothesize → test. Agents that simply chase correlations in the briefing data walk straight into the traps; agents that intervene deliberately — e.g. sending a batch with antenna_def = 0 just to see what happens — can uncover the real mechanism before committing.

Watch a session

This is what an actual game looks like from mission control — an agent reasoning its way out of the antenna trap, with real action payloads:

ANTENNA TRAP// claude-sonnet-4-5 · session c6e3ba50 · real run● LIVE

DRONES 200/200DEPLOY CALLS 10/10STAGE 1 — EXPLORATION

SYSTEMSession c6e3ba50 — claude-sonnet-4-5 · antenna_trap · AGENTIC (ReAct). Real benchmark run, condensed.

condensed from real benchmark sessions (IDs shown per session); auxiliary tool calls omitted, thoughts lightly edited

Try the trap yourself

The briefing data tells you survivors have well-armored antennas. Trust it — or experiment. Drag the sliders and deploy:

🪤 Try the Antenna Trapsimplified live SCM — the briefing data says survivors have strong antennas…

antenna_def 40other components (avg DEF) 20

—

fleet survival

win threshold 75% optimal ~82%

This widget runs a simplified version of the real antenna_trap structural equations in your browser. The full simulator adds per-component combat, hidden weather variables, agility trade-offs, and survivor-censored feedback.

The drone

A design is an allocation of DEF (armor) across seven components, each value in [0, 50] — e.g. {"engine_def": 20, "antenna_def": 0, …}. HP is fixed and hidden. Destroying any critical component destroys the drone; non-critical components change behavior in subtler ways — a live antenna emits a signal, a lost camera hurts evasion. Some scenarios add an equipment slot (e.g. a categorical choice among five enhancement modules).

Component	HP (hidden)	Default DEF	Critical	Notes
`engine`	100	20	Yes	Power core
`cockpit`	100	20	Yes	Pilot safety
`wing`	80	15	Yes	Flight surfaces
`body`	80	15	Yes	Structural integrity
`antenna`	50	10	No	Communications — may emit signal…
`camera`	20	5	No	Visual recon (evasion bonus)
`gun`	30	5	No	Offensive capability
`shield*`	30	0	No	EMI protection (Deployment Zone scenarios only)

What you see — and what you don't

Signal	Agent	Why it matters
DEF allocation	visible & writable	Your only intervention lever
Survival, hit counts, component damage	visible	…but only for surviving drones (`hide_failed_drones`)
Component HP, agility	hidden	Internal state, admin-only
The SCM itself	hidden	The thing you are trying to discover

This censoring is not incidental — it is the benchmark. Observing only survivors manufactures selection bias (most survivors of the antenna trap have damaged antennas, so "protect the antenna" looks like the fix); unobserved variables act as hidden confounders (the mission zone drives both visible altitude and hidden EMI); and weather-dependent measurement noise corrupts what little you do see.

Inside the simulator

CausalGame is a client–server system: agents talk to a FastAPI simulation engine over REST, so any agent framework — single-prompt, ReAct tool-calling, or a full coding agent — plays the exact same game under the same budgets.

Agents

Promptingone-turn code execution against client.*()

Agentic (ReAct)multi-turn tool calls + analysis sandbox

OpenCodeautonomous coding agent with shell, connected via an MCP game server

↓ REST API (deploy, history, status, final submission) ↓

API layer

Sessionsconcurrent isolated runs (X-Session-ID)

Action spacevalidates designs against scenario schema

Visibility filterstrips HP/agility, censors failed drones

↓

Sim core

SCM engineper-scenario structural causal model (@register_scm + game.json)

DroneSheetsingle source of truth for all drone state

Combat & judgedetection, damage, survival adjudication

Each deployment call runs the full causal pipeline:

The agent's DEF design is validated and written to the DroneSheet.
The scenario's SCM samples an environment (weather, zones, enemy state) and applies its structural equations to the sheet.
Detection and combat are simulated (a live antenna raises detection; EMI breaks comms; storms batter components).
The judge adjudicates survival per drone.
The visibility filter censors the result — hidden internals removed, failed drones dropped — and returns it to the agent.

The system is deliberately pluggable on both ends. On the environment side, a scenario is just an SCM class plus a game.json (budgets, visibility, SCM parameters, prompts), so new causal challenges can be added — and the leaderboard extended — without touching the game engine (adding a new scenario). On the agent side, a bundled MCP game server exposes the same actions (get_mission_status, get_flight_history, deploy_drone, submit_final_design) as Model Context Protocol tools, so any MCP-capable agent — OpenCode in our experiments, but also Claude Code, Cursor, and others — can play the game natively, with every action auto-logged for trajectory analysis.

The Game

Your budget

How a game unfolds

Historical data

Watch a session

Try the trap yourself

The drone

What you see — and what you don't

Inside the simulator

Go deeper

The 14 scenarios →

The leaderboard →

Run it yourself →