Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=76.0%, strict=76.0%, final_text=92.0%, schema_valid=100.0%, required_tool=92.0%, hallucinated_tool=0%, forbidden_tool=0%, recovery=100.0%. avg_tool_calls=3.44, avg_turns=3.96, output_tps=5.7, input_tokens=135960, output_tokens=20353. non_text_failures: final_state_wrong=4, max_tool_calls=4, missing_required_tool=4, too_many_tool_calls=12. strict_failure_breakdown: final_answer_missing_expected_text=4, final_state_wrong=4, max_tool_calls=4, missing_required_tool=4, too_many_tool_calls=12. Rerun after replacing broken Qwen3.6-27B-UD-Q4_K_XL-mlx artifact with Qwen3.6-27B-OptiQ-4bit.
Artifact/log: -OMLX local
Qwen3.6 27B OptiQ 4bit MLX
Local OpenAI-compatible OMLX provider; OptiQ 4-bit MLX quantization. Replaces the broken Qwen3.6-27B-UD-Q4_K_XL-mlx artifact that failed to load.
Score Summary
Overall eligibility requires eligible data in all four categories.
Overall
Incomplete for overall leaderboards
Agentic Tool Use
2/2 benchmarks covered
Agentic Coding
0/0 benchmarks covered
Long-Term Tasks
0/0 benchmarks covered
Speed
0/0 benchmarks covered
Latest Results
2 eval runs
Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=89.0%, strict=80.0%, final_text=83.0%, schema_valid=100.0%, required_tool=96.0%, hallucinated_tool=1.0%, forbidden_tool=0%, recovery=83.33%. avg_tool_calls=1.45, avg_turns=2.31, output_tps=7.1, input_tokens=109071, output_tokens=25429. non_text_failures: final_state_wrong=3, hallucinated_tool=1, max_tool_calls=8, missing_required_tool=4. strict_failure_breakdown: final_answer_missing_expected_text=17, final_state_wrong=3, hallucinated_tool=1, max_tool_calls=8, missing_required_tool=4. Rerun after replacing broken Qwen3.6-27B-UD-Q4_K_XL-mlx artifact with Qwen3.6-27B-OptiQ-4bit.
Artifact/log: -Historical Runs
2 eval runs
Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=76.0%, strict=76.0%, final_text=92.0%, schema_valid=100.0%, required_tool=92.0%, hallucinated_tool=0%, forbidden_tool=0%, recovery=100.0%. avg_tool_calls=3.44, avg_turns=3.96, output_tps=5.7, input_tokens=135960, output_tokens=20353. non_text_failures: final_state_wrong=4, max_tool_calls=4, missing_required_tool=4, too_many_tool_calls=12. strict_failure_breakdown: final_answer_missing_expected_text=4, final_state_wrong=4, max_tool_calls=4, missing_required_tool=4, too_many_tool_calls=12. Rerun after replacing broken Qwen3.6-27B-UD-Q4_K_XL-mlx artifact with Qwen3.6-27B-OptiQ-4bit.
Artifact/log: -Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=89.0%, strict=80.0%, final_text=83.0%, schema_valid=100.0%, required_tool=96.0%, hallucinated_tool=1.0%, forbidden_tool=0%, recovery=83.33%. avg_tool_calls=1.45, avg_turns=2.31, output_tps=7.1, input_tokens=109071, output_tokens=25429. non_text_failures: final_state_wrong=3, hallucinated_tool=1, max_tool_calls=8, missing_required_tool=4. strict_failure_breakdown: final_answer_missing_expected_text=17, final_state_wrong=3, hallucinated_tool=1, max_tool_calls=8, missing_required_tool=4. Rerun after replacing broken Qwen3.6-27B-UD-Q4_K_XL-mlx artifact with Qwen3.6-27B-OptiQ-4bit.
Artifact/log: -