Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=0%, strict=0%, final_text=60%, schema_valid=100%, required_tool=0%, hallucinated_tool=0%, forbidden_tool=0%, recovery=n/a. avg_tool_calls=0.00, output_tps=n/a, input_tokens=null, output_tokens=null. non_text_failures: final_state_wrong=35, missing_required_tool=50, model_error=50. strict_failure_breakdown: final_answer_missing_expected_text=20, final_state_wrong=35, missing_required_tool=50, model_error=50. Endpoint returned HTTP 500 model errors for every case; retained as failed availability result.
Artifact/log: -OMLX local
Qwen3.6 27B Q4 MLX
Local OpenAI-compatible OMLX provider; Q4_K_XL MLX quantization. This run returned endpoint model errors for all cases.
Score Summary
Overall eligibility requires eligible data in all four categories.
Overall
Incomplete for overall leaderboards
Agentic Tool Use
2/2 benchmarks covered
Agentic Coding
0/0 benchmarks covered
Long-Term Tasks
0/0 benchmarks covered
Speed
0/0 benchmarks covered
Latest Results
2 eval runs
Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=0%, strict=0%, final_text=0%, schema_valid=100%, required_tool=16%, hallucinated_tool=0%, forbidden_tool=0%, recovery=n/a. avg_tool_calls=0.00, output_tps=n/a, input_tokens=null, output_tokens=null. non_text_failures: final_state_wrong=21, missing_required_tool=84, model_error=100. strict_failure_breakdown: final_answer_missing_expected_text=100, final_state_wrong=21, missing_required_tool=84, model_error=100. Endpoint returned HTTP 500 model errors for every case; retained as failed availability result.
Artifact/log: -Historical Runs
2 eval runs
Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=0%, strict=0%, final_text=60%, schema_valid=100%, required_tool=0%, hallucinated_tool=0%, forbidden_tool=0%, recovery=n/a. avg_tool_calls=0.00, output_tps=n/a, input_tokens=null, output_tokens=null. non_text_failures: final_state_wrong=35, missing_required_tool=50, model_error=50. strict_failure_breakdown: final_answer_missing_expected_text=20, final_state_wrong=35, missing_required_tool=50, model_error=50. Endpoint returned HTTP 500 model errors for every case; retained as failed availability result.
Artifact/log: -Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=0%, strict=0%, final_text=0%, schema_valid=100%, required_tool=16%, hallucinated_tool=0%, forbidden_tool=0%, recovery=n/a. avg_tool_calls=0.00, output_tps=n/a, input_tokens=null, output_tokens=null. non_text_failures: final_state_wrong=21, missing_required_tool=84, model_error=100. strict_failure_breakdown: final_answer_missing_expected_text=100, final_state_wrong=21, missing_required_tool=84, model_error=100. Endpoint returned HTTP 500 model errors for every case; retained as failed availability result.
Artifact/log: -