Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate excluding exact final-text phrasing. tool_state=91%, strict=86%, final_text=89%, schema_valid=100%, required_tool=97%, hallucinated_tool=0%, forbidden_tool=1%, recovery=66.7%. output_tps=29.9, input_tokens=107708, output_tokens=28803. non_text_failures: final_state_wrong=4, forbidden_tool=1, max_tool_calls=5, missing_required_tool=3. strict_failure_breakdown: final_answer_missing_expected_text=11, final_state_wrong=4, forbidden_tool=1, max_tool_calls=5, missing_required_tool=3. Local OMLX run.
Artifact/log: -OMLX local
Qwen3.6 35B A3B Q4 MLX
Local OpenAI-compatible OMLX provider; Q4_K_XL MLX quantization. First real Hermes tool-contract run.
Score Summary
Overall eligibility requires eligible data in all four categories.
Overall
Incomplete for overall leaderboards
Agentic Tool Use
2/2 benchmarks covered
Agentic Coding
0/0 benchmarks covered
Long-Term Tasks
0/0 benchmarks covered
Speed
0/0 benchmarks covered
Latest Results
2 eval runs
Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=90%, strict=90%, final_text=94%, schema_valid=100%, required_tool=100%, hallucinated_tool=0%, forbidden_tool=0%, recovery=90.0%. avg_tool_calls=3.32, output_tps=26.2, input_tokens=126414, output_tokens=21073. non_text_failures: max_tool_calls=4, too_many_tool_calls=5. strict_failure_breakdown: final_answer_missing_expected_text=3, max_tool_calls=4, too_many_tool_calls=5. Local OMLX run.
Artifact/log: -Historical Runs
2 eval runs
Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate excluding exact final-text phrasing. tool_state=91%, strict=86%, final_text=89%, schema_valid=100%, required_tool=97%, hallucinated_tool=0%, forbidden_tool=1%, recovery=66.7%. output_tps=29.9, input_tokens=107708, output_tokens=28803. non_text_failures: final_state_wrong=4, forbidden_tool=1, max_tool_calls=5, missing_required_tool=3. strict_failure_breakdown: final_answer_missing_expected_text=11, final_state_wrong=4, forbidden_tool=1, max_tool_calls=5, missing_required_tool=3. Local OMLX run.
Artifact/log: -Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=90%, strict=90%, final_text=94%, schema_valid=100%, required_tool=100%, hallucinated_tool=0%, forbidden_tool=0%, recovery=90.0%. avg_tool_calls=3.32, output_tps=26.2, input_tokens=126414, output_tokens=21073. non_text_failures: max_tool_calls=4, too_many_tool_calls=5. strict_failure_breakdown: final_answer_missing_expected_text=3, max_tool_calls=4, too_many_tool_calls=5. Local OMLX run.
Artifact/log: -