Leaderboard

Stage-2 fleet survival rate (%), mean over 3 trials (3 independent trials per model × scenario). The leaderboard is updated as new models and execution modes are evaluated.

Structured tool calling over multiple turns with mandatory reasoning and an exploration guard.
#ModelAvgAntenna TrapDeployment Zone TrapWeatherWins / 14
1DeepSeek V4 Pro DeepSeek70.9 ±5.488.762.820.06
2Claude Opus 4.5 Anthropic69.1 ±5.480.764.829.87
3Claude Opus 4.7 Anthropic69.1 ±7.187.659.723.46
4GLM-5.1 Zhipu68.9 ±8.181.964.024.55
5GPT-5.5 High OpenAI68.5 ±4.082.462.824.35
6GPT-5.5 OpenAI68.4 ±4.383.161.428.46
7Qwen3.7 Max Alibaba Qwen68.4 ±5.979.064.929.16
8Claude Sonnet 4.5 Anthropic67.3 ±7.381.560.430.45
9MIMO V2.5 Pro Xiaomi MiMo66.5 ±5.179.860.826.94
10GPT-5.5 XHigh OpenAI65.1 ±7.178.959.520.75
11Grok 4.1 Fast xAI65.1 ±6.579.659.219.54
12MiniMax M2.1 MiniMax65.0 ±5.775.361.825.83
13MiniMax M2 MiniMax64.4 ±6.173.861.825.84
14HY3 Preview Tencent Hunyuan63.6 ±10.475.859.221.34
15DeepSeek V4 Flash DeepSeek63.4 ±8.776.757.922.84
16Gemini 3.5 Flash Google63.0 ±9.575.857.525.34
17Kimi K2.6 Moonshot AI62.6 ±9.676.956.320.84
18DeepSeek V3.2 DeepSeek61.6 ±6.771.657.829.23
19GPT-OSS-120B OpenAI61.1 ±9.474.454.527.24
20MIMO V2 Flash Xiaomi MiMo60.9 ±9.374.455.716.23
21GPT-5.2 High OpenAI60.7 ±6.777.651.026.33
22GPT-5 Mini OpenAI60.4 ±7.675.751.729.24
23DeepSeek V3.2 Think DeepSeek59.9 ±8.568.757.425.02
24Kimi K2.5 Moonshot AI58.9 ±9.867.156.030.31
25GLM-4.7 Zhipu58.1 ±7.066.356.520.11
26GPT-5.2 OpenAI57.9 ±6.772.250.920.73
27MiniMax M2.7 MiniMax57.6 ±11.762.856.931.01
28Grok 4.20 xAI54.7 ±6.564.950.821.50
29Gemini 3.1 Flash Lite Google49.5 ±10.351.252.618.10

Survival rate (%), mean of 3 independent trials per model × scenario. Hover a cell for std and 95% CI where available. Green cells meet the scenario win threshold (75%, or 55% for weather_noise). Family columns are means over per-scenario results; rank (#) is always by overall average within the mode.

Win thresholds & optimal designs

Scenario family Optimal intervention Optimal survival Win threshold
Antenna Trap antenna_def = 0 (stealth) ~82% 75%
Deployment Zone Trap shield_def = 25 + signal_filter ~80% 75%
Weather antenna_def ≈ 8 with Stage 2 tuning ~78% 55%

The optimal design for each scenario is derived analytically from the underlying SCM and verified empirically on fleets of 1,000 drones (agreement within ±2–3 pp). Thresholds sit 5–8% below the theoretical optimum, so the games are winnable with correct causal understanding but not through random exploration.

Non-LLM baselines

Baseline Strategy Avg survival
Default Submit the initial design unchanged 49.0%
Random Uniformly sample each DEF value from [0, 50] 52.0%
Uniform High Set all components to DEF = 50 52.7%
No-Explore LLM 10 random deploys, then LLM analyzes and submits 52–63%

Simple rule-based baselines reach 49–53% — and can even outperform several full-agent models on bias-heavy scenarios, underscoring that correlational shortcuts are not enough.

OpenCode vs other modes

Full per-scenario OpenCode results are in the OpenCode (Coding Agent) tab of the leaderboard above. Across every model evaluated, the coding-agent framework outperforms the model's own ReAct and Prompting scores, yet the average still falls short of the 75% win threshold — the causal thinking gap persists even with a full coding agent. An additional 5-model comparison (average 67.4% OpenCode vs 60.5% ReAct / 61.3% Prompting):

Model Δ survival, OpenCode vs ReAct
GPT-5.2 +13.9
GPT-5.2 High +9.3
GPT-5 Mini +6.3
Grok 4.1 Fast +2.7
Kimi K2.5 +2.2