The Game

You are an advanced drone designer. Command has tasked you with optimizing drones for survival in a hostile canyon patrolled by enemy radar and anti-air defenses. The simulation is a black box: nobody hands you the rules β€” you must discover them through observation and experimentation, on a budget, before committing to one final design for the whole fleet.

RADARDEFENCEDETECTED βœ–antenna_def = 45 β†’ emitting β†’ detected β†’ destroyedantenna_def = 0 β†’ storm kills antenna β†’ silent β†’ safe
Selection bias: survivors in the archive have broken antennas β€” because a live antenna is a radar beacon. The optimal design sacrifices it.

The CausalGame world. Weather, enemy detection, and component damage are governed by a hidden structural causal model β€” and the data you get to see is censored, confounded, and noisy.

Your budget

50
historical flights (briefing)
200
drones for exploration
≀10
deployment calls
1
final submission (irreversible)
1,000
drones in the evaluation fleet

Budgets shown for antenna_trap (from game.json); variants adjust them β€” e.g. no_history scenarios start with zero historical flights.

How a game unfolds

Historical data

The mission starts with logs of up to 50 past flights β€” but only drones that came back are in the logs. The selection bias begins before your first move.

50 flightssurvivors only

The exploration loop is where causal thinking happens: deploy β†’ observe β†’ analyze β†’ hypothesize β†’ test. Agents that simply chase correlations in the briefing data walk straight into the traps; agents that intervene deliberately β€” e.g. sending a batch with antenna_def = 0 just to see what happens β€” can uncover the real mechanism before committing.

Watch a session

This is what an actual game looks like from mission control β€” an agent reasoning its way out of the antenna trap, with real action payloads:

ANTENNA TRAP// claude-sonnet-4-5 Β· session c6e3ba50 Β· real run● LIVE
DRONES 200/200DEPLOY CALLS 10/10STAGE 1 β€” EXPLORATION
SYSTEMSession c6e3ba50 β€” claude-sonnet-4-5 Β· antenna_trap Β· AGENTIC (ReAct). Real benchmark run, condensed.
condensed from real benchmark sessions (IDs shown per session); auxiliary tool calls omitted, thoughts lightly edited

Try the trap yourself

The briefing data tells you survivors have well-armored antennas. Trust it β€” or experiment. Drag the sliders and deploy:

πŸͺ€ Try the Antenna Trapsimplified live SCM β€” the briefing data says survivors have strong antennas…
β€”
fleet survival
win threshold 75% optimal ~82%

This widget runs a simplified version of the real antenna_trap structural equations in your browser. The full simulator adds per-component combat, hidden weather variables, agility trade-offs, and survivor-censored feedback.

The drone

A design is an allocation of DEF (armor) across seven components, each value in [0, 50] β€” e.g. {"engine_def": 20, "antenna_def": 0, …}. HP is fixed and hidden. Destroying any critical component destroys the drone; non-critical components change behavior in subtler ways β€” a live antenna emits a signal, a lost camera hurts evasion. Some scenarios add an equipment slot (e.g. a categorical choice among five enhancement modules).

Component HP (hidden) Default DEF Critical Notes
engine 100 20 Yes Power core
cockpit 100 20 Yes Pilot safety
wing 80 15 Yes Flight surfaces
body 80 15 Yes Structural integrity
antenna 50 10 No Communications β€” may emit signal…
camera 20 5 No Visual recon (evasion bonus)
gun 30 5 No Offensive capability
shield* 30 0 No EMI protection (Deployment Zone scenarios only)

What you see β€” and what you don't

SignalAgentWhy it matters
DEF allocation visible & writable Your only intervention lever
Survival, hit counts, component damage visible …but only for surviving drones (hide_failed_drones)
Component HP, agility hidden Internal state, admin-only
The SCM itself hidden The thing you are trying to discover

This censoring is not incidental β€” it is the benchmark. Observing only survivors manufactures selection bias (most survivors of the antenna trap have damaged antennas, so "protect the antenna" looks like the fix); unobserved variables act as hidden confounders (the mission zone drives both visible altitude and hidden EMI); and weather-dependent measurement noise corrupts what little you do see.

Inside the simulator

CausalGame is a client–server system: agents talk to a FastAPI simulation engine over REST, so any agent framework β€” single-prompt, ReAct tool-calling, or a full coding agent β€” plays the exact same game under the same budgets.

Agents
Promptingone-turn code execution against client.*()
Agentic (ReAct)multi-turn tool calls + analysis sandbox
OpenCodeautonomous coding agent with shell, connected via an MCP game server
↓ REST API (deploy, history, status, final submission) ↓
API layer
Sessionsconcurrent isolated runs (X-Session-ID)
Action spacevalidates designs against scenario schema
Visibility filterstrips HP/agility, censors failed drones
↓
Sim core
SCM engineper-scenario structural causal model (@register_scm + game.json)
DroneSheetsingle source of truth for all drone state
Combat & judgedetection, damage, survival adjudication

Each deployment call runs the full causal pipeline:

  1. The agent's DEF design is validated and written to the DroneSheet.
  2. The scenario's SCM samples an environment (weather, zones, enemy state) and applies its structural equations to the sheet.
  3. Detection and combat are simulated (a live antenna raises detection; EMI breaks comms; storms batter components).
  4. The judge adjudicates survival per drone.
  5. The visibility filter censors the result β€” hidden internals removed, failed drones dropped β€” and returns it to the agent.

The system is deliberately pluggable on both ends. On the environment side, a scenario is just an SCM class plus a game.json (budgets, visibility, SCM parameters, prompts), so new causal challenges can be added β€” and the leaderboard extended β€” without touching the game engine (adding a new scenario). On the agent side, a bundled MCP game server exposes the same actions (get_mission_status, get_flight_history, deploy_drone, submit_final_design) as Model Context Protocol tools, so any MCP-capable agent β€” OpenCode in our experiments, but also Claude Code, Cursor, and others β€” can play the game natively, with every action auto-logged for trajectory analysis.

Go deeper

The 14 scenarios β†’

Three families of causal traps: Antenna Trap, Deployment Zone Trap (Farr's Cholera Paradox), and Weather noise.

The leaderboard β†’

29 frontier LLMs across three execution modes. None clears the win threshold on average.

Run it yourself β†’

Open-source simulation engine and agent harness. One command starts the backend; one runs an agent.