Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=72%, strict=72%, final_text=94%, schema_valid=100%, required_tool=96%, hallucinated_tool=0%, forbidden_tool=0%, recovery=100.0%. avg_tool_calls=3.42, output_tps=21.5, input_tokens=127846, output_tokens=15856. non_text_failures: final_state_wrong=1, max_tool_calls=3, missing_required_tool=2, too_many_tool_calls=12. strict_failure_breakdown: final_answer_missing_expected_text=3, final_state_wrong=1, max_tool_calls=3, missing_required_tool=2, too_many_tool_calls=12. Local OMLX Ornith 5-bit run.
Artifact/log: -OMLX local
Ornith 1.0 35B 5bit MLX
Local OpenAI-compatible OMLX provider; 5-bit MLX quantization.
Score Summary
Overall eligibility requires eligible data in all four categories.
Overall
Incomplete for overall leaderboards
Agentic Tool Use
2/2 benchmarks covered
Agentic Coding
0/0 benchmarks covered
Long-Term Tasks
0/0 benchmarks covered
Speed
0/0 benchmarks covered
Latest Results
2 eval runs
Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=80%, strict=76%, final_text=84%, schema_valid=100%, required_tool=94%, hallucinated_tool=2%, forbidden_tool=2%, recovery=83.3%. avg_tool_calls=1.58, output_tps=24.0, input_tokens=108310, output_tokens=19479. non_text_failures: final_state_wrong=5, forbidden_tool=2, hallucinated_tool=2, max_tool_calls=12, missing_required_tool=6. strict_failure_breakdown: final_answer_missing_expected_text=16, final_state_wrong=5, forbidden_tool=2, hallucinated_tool=2, max_tool_calls=12, missing_required_tool=6. Local OMLX Ornith 5-bit run.
Artifact/log: -Historical Runs
2 eval runs
Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=72%, strict=72%, final_text=94%, schema_valid=100%, required_tool=96%, hallucinated_tool=0%, forbidden_tool=0%, recovery=100.0%. avg_tool_calls=3.42, output_tps=21.5, input_tokens=127846, output_tokens=15856. non_text_failures: final_state_wrong=1, max_tool_calls=3, missing_required_tool=2, too_many_tool_calls=12. strict_failure_breakdown: final_answer_missing_expected_text=3, final_state_wrong=1, max_tool_calls=3, missing_required_tool=2, too_many_tool_calls=12. Local OMLX Ornith 5-bit run.
Artifact/log: -Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=80%, strict=76%, final_text=84%, schema_valid=100%, required_tool=94%, hallucinated_tool=2%, forbidden_tool=2%, recovery=83.3%. avg_tool_calls=1.58, output_tps=24.0, input_tokens=108310, output_tokens=19479. non_text_failures: final_state_wrong=5, forbidden_tool=2, hallucinated_tool=2, max_tool_calls=12, missing_required_tool=6. strict_failure_breakdown: final_answer_missing_expected_text=16, final_state_wrong=5, forbidden_tool=2, hallucinated_tool=2, max_tool_calls=12, missing_required_tool=6. Local OMLX Ornith 5-bit run.
Artifact/log: -