Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=68%, strict=68%, final_text=100%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.52, output_tps=18.7, input_tokens=146562, output_tokens=19980. non_text_failures: final_state_wrong=5, max_tool_calls=2, missing_required_tool=4, too_many_tool_calls=12. strict_failure_breakdown: final_state_wrong=5, max_tool_calls=2, missing_required_tool=4, too_many_tool_calls=12. Local OMLX Qwen3.5 9B 4-bit run.
Artifact/log: -OMLX local
Qwen3.5 9B MLX 4bit
Local OpenAI-compatible OMLX provider; 4-bit MLX quantization.
Score Summary
Overall eligibility requires eligible data in all four categories.
Overall
Incomplete for overall leaderboards
Agentic Tool Use
2/2 benchmarks covered
Agentic Coding
0/0 benchmarks covered
Long-Term Tasks
0/0 benchmarks covered
Speed
0/0 benchmarks covered
Latest Results
2 eval runs
Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=85%, strict=80%, final_text=88%, schema_valid=100%, required_tool=95%, hallucinated_tool=0%, forbidden_tool=1%, recovery=85.7%. avg_tool_calls=1.52, output_tps=22.6, input_tokens=114739, output_tokens=24411. non_text_failures: final_state_wrong=5, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. strict_failure_breakdown: final_answer_missing_expected_text=12, final_state_wrong=5, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. Local OMLX Qwen3.5 9B 4-bit run.
Artifact/log: -Historical Runs
2 eval runs
Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=68%, strict=68%, final_text=100%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.52, output_tps=18.7, input_tokens=146562, output_tokens=19980. non_text_failures: final_state_wrong=5, max_tool_calls=2, missing_required_tool=4, too_many_tool_calls=12. strict_failure_breakdown: final_state_wrong=5, max_tool_calls=2, missing_required_tool=4, too_many_tool_calls=12. Local OMLX Qwen3.5 9B 4-bit run.
Artifact/log: -Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=85%, strict=80%, final_text=88%, schema_valid=100%, required_tool=95%, hallucinated_tool=0%, forbidden_tool=1%, recovery=85.7%. avg_tool_calls=1.52, output_tps=22.6, input_tokens=114739, output_tokens=24411. non_text_failures: final_state_wrong=5, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. strict_failure_breakdown: final_answer_missing_expected_text=12, final_state_wrong=5, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. Local OMLX Qwen3.5 9B 4-bit run.
Artifact/log: -