Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=82%, strict=82%, final_text=94%, schema_valid=100%, required_tool=98%, hallucinated_tool=0%, forbidden_tool=0%, recovery=90.0%. avg_tool_calls=3.38, output_tps=25.4, input_tokens=126116, output_tokens=20599. non_text_failures: max_tool_calls=4, missing_required_tool=1, too_many_tool_calls=8. strict_failure_breakdown: final_answer_missing_expected_text=3, max_tool_calls=4, missing_required_tool=1, too_many_tool_calls=8. Local OMLX Q8 run.
Artifact/log: -OMLX local
Qwen3.6 35B A3B Q8 MLX
Local OpenAI-compatible OMLX provider; Q8_K_XL MLX quantization.
Score Summary
Overall eligibility requires eligible data in all four categories.
Overall
Incomplete for overall leaderboards
Agentic Tool Use
2/2 benchmarks covered
Agentic Coding
0/0 benchmarks covered
Long-Term Tasks
0/0 benchmarks covered
Speed
0/0 benchmarks covered
Latest Results
2 eval runs
Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=86%, strict=82%, final_text=85%, schema_valid=100%, required_tool=95%, hallucinated_tool=0%, forbidden_tool=1%, recovery=50.0%. avg_tool_calls=1.46, output_tps=28.8, input_tokens=107163, output_tokens=28249. non_text_failures: final_state_wrong=3, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. strict_failure_breakdown: final_answer_missing_expected_text=15, final_state_wrong=3, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. Local OMLX Q8 run.
Artifact/log: -Historical Runs
2 eval runs
Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=82%, strict=82%, final_text=94%, schema_valid=100%, required_tool=98%, hallucinated_tool=0%, forbidden_tool=0%, recovery=90.0%. avg_tool_calls=3.38, output_tps=25.4, input_tokens=126116, output_tokens=20599. non_text_failures: max_tool_calls=4, missing_required_tool=1, too_many_tool_calls=8. strict_failure_breakdown: final_answer_missing_expected_text=3, max_tool_calls=4, missing_required_tool=1, too_many_tool_calls=8. Local OMLX Q8 run.
Artifact/log: -Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=86%, strict=82%, final_text=85%, schema_valid=100%, required_tool=95%, hallucinated_tool=0%, forbidden_tool=1%, recovery=50.0%. avg_tool_calls=1.46, output_tps=28.8, input_tokens=107163, output_tokens=28249. non_text_failures: final_state_wrong=3, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. strict_failure_breakdown: final_answer_missing_expected_text=15, final_state_wrong=3, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. Local OMLX Q8 run.
Artifact/log: -