Static LLM benchmark dashboard

Agentic model rankings with explicit coverage and quality gates.

Real local-model eval results for Hermes-style agent/tool-calling work. Overall leaders require eligible data in every category as the benchmark set grows.

Current Leaders

Incomplete models can win category cards, but not the overall leaderboard.

Best Agentic Tool UseEligible
Qwen3.6 35B A3B Q4 MLX
OMLX local / Qwen
Leader score86.0

1/1 benchmark covered

Latest runJul 4, 2026
Best Agentic CodingNo data
No dataNo eligible model has results for this category.
Best Long-Term TasksNo data
No dataNo eligible model has results for this category.
Best SpeedNo data
No dataNo eligible model has results for this category.

Model Comparison

Switch categories and views without hiding dates, harnesses, coverage, or score status.

RankModelProviderOverallStatus / coverageScore railLatest runBenchmark / harnessPass rateLatency / TPSCost
01GPT-5.5
Version gpt-5.5
OpenAI Codex
GPT
No dataIncomplete
1/1 benchmarks
IncompleteIncomplete
Jul 4, 2026Hermes Tool Contract (Hermes Agent Evals hermes_tool_contract_v0)72.0%5.47s / --
02Qwen3.6 35B A3B Q4 MLX
Version Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx
OMLX local
Qwen
No dataIncomplete
1/1 benchmarks
IncompleteIncomplete
Jul 4, 2026Hermes Tool Contract (Hermes Agent Evals hermes_tool_contract_v0)86.0%9.67s / -$0.00
No dataNo eligible rows are available for this chart.

Not charted

GPT-5.5: IncompleteQwen3.6 35B A3B Q4 MLX: Incomplete
RankModelProviderAgentic Tool UseStatus / coverageScore railLatest runBenchmark / harnessPass rateLatency / TPSCost
01Qwen3.6 35B A3B Q4 MLX
Version Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx
OMLX local
Qwen
86.0Eligible
1/1 benchmarks
Eligible86.0
Jul 4, 2026Hermes Tool Contract (Hermes Agent Evals hermes_tool_contract_v0)86.0%9.67s / -$0.00
02GPT-5.5
Version gpt-5.5
OpenAI Codex
GPT
72.0Eligible
1/1 benchmarks
Eligible72.0
Jul 4, 2026Hermes Tool Contract (Hermes Agent Evals hermes_tool_contract_v0)72.0%5.47s / --
RankModelProviderAgentic CodingStatus / coverageScore railLatest runBenchmark / harnessPass rateLatency / TPSCost
01GPT-5.5
Version gpt-5.5
OpenAI Codex
GPT
No dataNo data
0/0 benchmarks
No dataNo data
-No data---
02Qwen3.6 35B A3B Q4 MLX
Version Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx
OMLX local
Qwen
No dataNo data
0/0 benchmarks
No dataNo data
-No data---
No dataNo eligible rows are available for this chart.

Not charted

GPT-5.5: No dataQwen3.6 35B A3B Q4 MLX: No data
RankModelProviderLong-Term TasksStatus / coverageScore railLatest runBenchmark / harnessPass rateLatency / TPSCost
01GPT-5.5
Version gpt-5.5
OpenAI Codex
GPT
No dataNo data
0/0 benchmarks
No dataNo data
-No data---
02Qwen3.6 35B A3B Q4 MLX
Version Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx
OMLX local
Qwen
No dataNo data
0/0 benchmarks
No dataNo data
-No data---
No dataNo eligible rows are available for this chart.

Not charted

GPT-5.5: No dataQwen3.6 35B A3B Q4 MLX: No data
RankModelProviderSpeedStatus / coverageScore railLatest runBenchmark / harnessPass rateCompletion / TPSCost
01GPT-5.5
Version gpt-5.5
OpenAI Codex
GPT
No dataNo data
0/0 benchmarks
No dataNo data
-No data---
02Qwen3.6 35B A3B Q4 MLX
Version Qwen3.6-35B-A3B-UD-Q4_K_XL-mlx
OMLX local
Qwen
No dataNo data
0/0 benchmarks
No dataNo data
-No data---
No dataNo eligible rows are available for this chart.

Not charted

GPT-5.5: No dataQwen3.6 35B A3B Q4 MLX: No data