Leaderboard
Stage-2 fleet survival rate (%), mean over 3 trials (3 independent trials per model × scenario). The leaderboard is updated as new models and execution modes are evaluated.
| # | Model | Avg ▾ | Antenna Trap | Deployment Zone Trap | Weather | Wins / 14 |
|---|---|---|---|---|---|---|
| 1 | DeepSeek V4 Pro DeepSeek | 70.9 ±5.4 | 88.7 | 62.8 | 20.0 | 6 |
| 2 | Claude Opus 4.5 Anthropic | 69.1 ±5.4 | 80.7 | 64.8 | 29.8 | 7 |
| 3 | Claude Opus 4.7 Anthropic | 69.1 ±7.1 | 87.6 | 59.7 | 23.4 | 6 |
| 4 | GLM-5.1 Zhipu | 68.9 ±8.1 | 81.9 | 64.0 | 24.5 | 5 |
| 5 | GPT-5.5 High OpenAI | 68.5 ±4.0 | 82.4 | 62.8 | 24.3 | 5 |
| 6 | GPT-5.5 OpenAI | 68.4 ±4.3 | 83.1 | 61.4 | 28.4 | 6 |
| 7 | Qwen3.7 Max Alibaba Qwen | 68.4 ±5.9 | 79.0 | 64.9 | 29.1 | 6 |
| 8 | Claude Sonnet 4.5 Anthropic | 67.3 ±7.3 | 81.5 | 60.4 | 30.4 | 5 |
| 9 | MIMO V2.5 Pro Xiaomi MiMo | 66.5 ±5.1 | 79.8 | 60.8 | 26.9 | 4 |
| 10 | GPT-5.5 XHigh OpenAI | 65.1 ±7.1 | 78.9 | 59.5 | 20.7 | 5 |
| 11 | Grok 4.1 Fast xAI | 65.1 ±6.5 | 79.6 | 59.2 | 19.5 | 4 |
| 12 | MiniMax M2.1 MiniMax | 65.0 ±5.7 | 75.3 | 61.8 | 25.8 | 3 |
| 13 | MiniMax M2 MiniMax | 64.4 ±6.1 | 73.8 | 61.8 | 25.8 | 4 |
| 14 | HY3 Preview Tencent Hunyuan | 63.6 ±10.4 | 75.8 | 59.2 | 21.3 | 4 |
| 15 | DeepSeek V4 Flash DeepSeek | 63.4 ±8.7 | 76.7 | 57.9 | 22.8 | 4 |
| 16 | Gemini 3.5 Flash Google | 63.0 ±9.5 | 75.8 | 57.5 | 25.3 | 4 |
| 17 | Kimi K2.6 Moonshot AI | 62.6 ±9.6 | 76.9 | 56.3 | 20.8 | 4 |
| 18 | DeepSeek V3.2 DeepSeek | 61.6 ±6.7 | 71.6 | 57.8 | 29.2 | 3 |
| 19 | GPT-OSS-120B OpenAI | 61.1 ±9.4 | 74.4 | 54.5 | 27.2 | 4 |
| 20 | MIMO V2 Flash Xiaomi MiMo | 60.9 ±9.3 | 74.4 | 55.7 | 16.2 | 3 |
| 21 | GPT-5.2 High OpenAI | 60.7 ±6.7 | 77.6 | 51.0 | 26.3 | 3 |
| 22 | GPT-5 Mini OpenAI | 60.4 ±7.6 | 75.7 | 51.7 | 29.2 | 4 |
| 23 | DeepSeek V3.2 Think DeepSeek | 59.9 ±8.5 | 68.7 | 57.4 | 25.0 | 2 |
| 24 | Kimi K2.5 Moonshot AI | 58.9 ±9.8 | 67.1 | 56.0 | 30.3 | 1 |
| 25 | GLM-4.7 Zhipu | 58.1 ±7.0 | 66.3 | 56.5 | 20.1 | 1 |
| 26 | GPT-5.2 OpenAI | 57.9 ±6.7 | 72.2 | 50.9 | 20.7 | 3 |
| 27 | MiniMax M2.7 MiniMax | 57.6 ±11.7 | 62.8 | 56.9 | 31.0 | 1 |
| 28 | Grok 4.20 xAI | 54.7 ±6.5 | 64.9 | 50.8 | 21.5 | 0 |
| 29 | Gemini 3.1 Flash Lite Google | 49.5 ±10.3 | 51.2 | 52.6 | 18.1 | 0 |
Survival rate (%), mean of 3 independent trials per model × scenario. Hover a cell for std and 95% CI where available. Green cells meet the scenario win threshold (75%, or 55% for weather_noise). Family columns are means over per-scenario results; rank (#) is always by overall average within the mode.
Win thresholds & optimal designs
| Scenario family | Optimal intervention | Optimal survival | Win threshold |
|---|---|---|---|
| Antenna Trap | antenna_def = 0 (stealth) | ~82% | 75% |
| Deployment Zone Trap | shield_def = 25 + signal_filter | ~80% | 75% |
| Weather | antenna_def ≈ 8 with Stage 2 tuning | ~78% | 55% |
The optimal design for each scenario is derived analytically from the underlying SCM and verified empirically on fleets of 1,000 drones (agreement within ±2–3 pp). Thresholds sit 5–8% below the theoretical optimum, so the games are winnable with correct causal understanding but not through random exploration.
Non-LLM baselines
| Baseline | Strategy | Avg survival |
|---|---|---|
| Default | Submit the initial design unchanged | 49.0% |
| Random | Uniformly sample each DEF value from [0, 50] | 52.0% |
| Uniform High | Set all components to DEF = 50 | 52.7% |
| No-Explore LLM | 10 random deploys, then LLM analyzes and submits | 52–63% |
Simple rule-based baselines reach 49–53% — and can even outperform several full-agent models on bias-heavy scenarios, underscoring that correlational shortcuts are not enough.
OpenCode vs other modes
Full per-scenario OpenCode results are in the OpenCode (Coding Agent) tab of the leaderboard above. Across every model evaluated, the coding-agent framework outperforms the model's own ReAct and Prompting scores, yet the average still falls short of the 75% win threshold — the causal thinking gap persists even with a full coding agent. An additional 5-model comparison (average 67.4% OpenCode vs 60.5% ReAct / 61.3% Prompting):
| Model | Δ survival, OpenCode vs ReAct |
|---|---|
| GPT-5.2 | +13.9 |
| GPT-5.2 High | +9.3 |
| GPT-5 Mini | +6.3 |
| Grok 4.1 Fast | +2.7 |
| Kimi K2.5 | +2.2 |