The Game
You are an advanced drone designer. Command has tasked you with optimizing drones for survival in a hostile canyon patrolled by enemy radar and anti-air defenses. The simulation is a black box: nobody hands you the rules β you must discover them through observation and experimentation, on a budget, before committing to one final design for the whole fleet.
The CausalGame world. Weather, enemy detection, and component damage are governed by a hidden structural causal model β and the data you get to see is censored, confounded, and noisy.
Your budget
Budgets shown for antenna_trap (from game.json); variants
adjust them β e.g. no_history scenarios start with zero historical
flights.
How a game unfolds
Historical data
The mission starts with logs of up to 50 past flights β but only drones that came back are in the logs. The selection bias begins before your first move.
The exploration loop is where causal thinking happens: deploy β observe β analyze β
hypothesize β test. Agents that simply chase correlations in the briefing data walk
straight into the traps; agents that intervene deliberately β e.g. sending a batch
with antenna_def = 0 just to see what happens β can uncover the real
mechanism before committing.
Watch a session
This is what an actual game looks like from mission control β an agent reasoning its way out of the antenna trap, with real action payloads:
Try the trap yourself
The briefing data tells you survivors have well-armored antennas. Trust it β or experiment. Drag the sliders and deploy:
This widget runs a simplified version of the real antenna_trap
structural equations in your browser. The full simulator adds per-component combat,
hidden weather variables, agility trade-offs, and survivor-censored feedback.
The drone
A design is an allocation of DEF (armor) across seven components,
each value in [0, 50] β e.g.
{"engine_def": 20, "antenna_def": 0, β¦}. HP is fixed and
hidden. Destroying any critical component destroys the drone; non-critical
components change behavior in subtler ways β a live antenna emits a signal, a lost
camera hurts evasion. Some scenarios add an equipment slot (e.g. a categorical
choice among five enhancement modules).
| Component | HP (hidden) | Default DEF | Critical | Notes |
|---|---|---|---|---|
engine | 100 | 20 | Yes | Power core |
cockpit | 100 | 20 | Yes | Pilot safety |
wing | 80 | 15 | Yes | Flight surfaces |
body | 80 | 15 | Yes | Structural integrity |
antenna | 50 | 10 | No | Communications β may emit signalβ¦ |
camera | 20 | 5 | No | Visual recon (evasion bonus) |
gun | 30 | 5 | No | Offensive capability |
shield* | 30 | 0 | No | EMI protection (Deployment Zone scenarios only) |
What you see β and what you don't
| Signal | Agent | Why it matters |
|---|---|---|
| DEF allocation | visible & writable | Your only intervention lever |
| Survival, hit counts, component damage | visible | β¦but only for surviving drones (hide_failed_drones) |
| Component HP, agility | hidden | Internal state, admin-only |
| The SCM itself | hidden | The thing you are trying to discover |
This censoring is not incidental β it is the benchmark. Observing only survivors manufactures selection bias (most survivors of the antenna trap have damaged antennas, so "protect the antenna" looks like the fix); unobserved variables act as hidden confounders (the mission zone drives both visible altitude and hidden EMI); and weather-dependent measurement noise corrupts what little you do see.
Inside the simulator
CausalGame is a clientβserver system: agents talk to a FastAPI simulation engine over REST, so any agent framework β single-prompt, ReAct tool-calling, or a full coding agent β plays the exact same game under the same budgets.
client.*()X-Session-ID)@register_scm + game.json)Each deployment call runs the full causal pipeline:
- The agent's DEF design is validated and written to the DroneSheet.
- The scenario's SCM samples an environment (weather, zones, enemy state) and applies its structural equations to the sheet.
- Detection and combat are simulated (a live antenna raises detection; EMI breaks comms; storms batter components).
- The judge adjudicates survival per drone.
- The visibility filter censors the result β hidden internals removed, failed drones dropped β and returns it to the agent.
The system is deliberately pluggable on both ends. On the environment side, a
scenario is just an SCM class plus a game.json (budgets, visibility,
SCM parameters, prompts), so new causal challenges can be added β and the
leaderboard extended β without touching the game engine
(adding a new scenario).
On the agent side, a bundled MCP game server exposes the same
actions (get_mission_status, get_flight_history,
deploy_drone, submit_final_design) as
Model Context Protocol
tools, so any MCP-capable agent β OpenCode in our experiments, but also Claude Code,
Cursor, and others β can play the game natively, with every action auto-logged for
trajectory analysis.
Go deeper
The 14 scenarios β
Three families of causal traps: Antenna Trap, Deployment Zone Trap (Farr's Cholera Paradox), and Weather noise.
The leaderboard β
29 frontier LLMs across three execution modes. None clears the win threshold on average.
Run it yourself β
Open-source simulation engine and agent harness. One command starts the backend; one runs an agent.