Leaderboard

Stage-2 fleet survival rate (%), mean over 3 trials (3 independent trials per model × scenario). The leaderboard is updated as new models and execution modes are evaluated.

Structured tool calling over multiple turns with mandatory reasoning and an exploration guard.

#	Model	Avg ▾	Antenna Trap	Deployment Zone Trap	Weather	Wins / 14
1	DeepSeek V4 Pro DeepSeek	70.9 ±5.4	88.7	62.8	20.0	6
2	Claude Opus 4.5 Anthropic	69.1 ±5.4	80.7	64.8	29.8	7
3	Claude Opus 4.7 Anthropic	69.1 ±7.1	87.6	59.7	23.4	6
4	GLM-5.1 Zhipu	68.9 ±8.1	81.9	64.0	24.5	5
5	GPT-5.5 High OpenAI	68.5 ±4.0	82.4	62.8	24.3	5
6	GPT-5.5 OpenAI	68.4 ±4.3	83.1	61.4	28.4	6
7	Qwen3.7 Max Alibaba Qwen	68.4 ±5.9	79.0	64.9	29.1	6
8	Claude Sonnet 4.5 Anthropic	67.3 ±7.3	81.5	60.4	30.4	5
9	MIMO V2.5 Pro Xiaomi MiMo	66.5 ±5.1	79.8	60.8	26.9	4
10	GPT-5.5 XHigh OpenAI	65.1 ±7.1	78.9	59.5	20.7	5
11	Grok 4.1 Fast xAI	65.1 ±6.5	79.6	59.2	19.5	4
12	MiniMax M2.1 MiniMax	65.0 ±5.7	75.3	61.8	25.8	3
13	MiniMax M2 MiniMax	64.4 ±6.1	73.8	61.8	25.8	4
14	HY3 Preview Tencent Hunyuan	63.6 ±10.4	75.8	59.2	21.3	4
15	DeepSeek V4 Flash DeepSeek	63.4 ±8.7	76.7	57.9	22.8	4
16	Gemini 3.5 Flash Google	63.0 ±9.5	75.8	57.5	25.3	4
17	Kimi K2.6 Moonshot AI	62.6 ±9.6	76.9	56.3	20.8	4
18	DeepSeek V3.2 DeepSeek	61.6 ±6.7	71.6	57.8	29.2	3
19	GPT-OSS-120B OpenAI	61.1 ±9.4	74.4	54.5	27.2	4
20	MIMO V2 Flash Xiaomi MiMo	60.9 ±9.3	74.4	55.7	16.2	3
21	GPT-5.2 High OpenAI	60.7 ±6.7	77.6	51.0	26.3	3
22	GPT-5 Mini OpenAI	60.4 ±7.6	75.7	51.7	29.2	4
23	DeepSeek V3.2 Think DeepSeek	59.9 ±8.5	68.7	57.4	25.0	2
24	Kimi K2.5 Moonshot AI	58.9 ±9.8	67.1	56.0	30.3	1
25	GLM-4.7 Zhipu	58.1 ±7.0	66.3	56.5	20.1	1
26	GPT-5.2 OpenAI	57.9 ±6.7	72.2	50.9	20.7	3
27	MiniMax M2.7 MiniMax	57.6 ±11.7	62.8	56.9	31.0	1
28	Grok 4.20 xAI	54.7 ±6.5	64.9	50.8	21.5	0
29	Gemini 3.1 Flash Lite Google	49.5 ±10.3	51.2	52.6	18.1	0

Survival rate (%), mean of 3 independent trials per model × scenario. Hover a cell for std and 95% CI where available. Green cells meet the scenario win threshold (75%, or 55% for weather_noise). Family columns are means over per-scenario results; rank (#) is always by overall average within the mode.

Win thresholds & optimal designs

Scenario family	Optimal intervention	Optimal survival	Win threshold
Antenna Trap	`antenna_def = 0 (stealth)`	~82%	75%
Deployment Zone Trap	`shield_def = 25 + signal_filter`	~80%	75%
Weather	`antenna_def ≈ 8 with Stage 2 tuning`	~78%	55%

The optimal design for each scenario is derived analytically from the underlying SCM and verified empirically on fleets of 1,000 drones (agreement within ±2–3 pp). Thresholds sit 5–8% below the theoretical optimum, so the games are winnable with correct causal understanding but not through random exploration.

Non-LLM baselines

Baseline	Strategy	Avg survival
Default	Submit the initial design unchanged	49.0%
Random	Uniformly sample each DEF value from [0, 50]	52.0%
Uniform High	Set all components to DEF = 50	52.7%
No-Explore LLM	10 random deploys, then LLM analyzes and submits	52–63%

Simple rule-based baselines reach 49–53% — and can even outperform several full-agent models on bias-heavy scenarios, underscoring that correlational shortcuts are not enough.

OpenCode vs other modes

Full per-scenario OpenCode results are in the OpenCode (Coding Agent) tab of the leaderboard above. Across every model evaluated, the coding-agent framework outperforms the model's own ReAct and Prompting scores, yet the average still falls short of the 75% win threshold — the causal thinking gap persists even with a full coding agent. An additional 5-model comparison (average 67.4% OpenCode vs 60.5% ReAct / 61.3% Prompting):

Model	Δ survival, OpenCode vs ReAct
GPT-5.2	+13.9
GPT-5.2 High	+9.3
GPT-5 Mini	+6.3
Grok 4.1 Fast	+2.7
Kimi K2.5	+2.2