Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate excluding exact final-text phrasing. tool_state=91%, strict=72%, final_text=72%, schema_valid=100%, required_tool=97%, hallucinated_tool=0%, forbidden_tool=0%, recovery=57.1%. output_tps=13.9, input_tokens=42431, output_tokens=7688. non_text_failures: final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. strict_failure_breakdown: final_answer_missing_expected_text=28, final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. Reference run via Hermes openai-codex OAuth adapter.
Artifact/log: -OpenAI Codex
GPT-5.5
OpenAI Codex OAuth provider. Reference run for local Hermes tool-contract model comparisons.
Score Summary
Overall eligibility requires eligible data in all four categories.
Overall
Incomplete for overall leaderboards
Agentic Tool Use
2/2 benchmarks covered
Agentic Coding
0/0 benchmarks covered
Long-Term Tasks
0/0 benchmarks covered
Speed
0/0 benchmarks covered
Latest Results
2 eval runs
Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=76%, strict=76%, final_text=94%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.58, output_tps=13.5, input_tokens=62360, output_tokens=7630. non_text_failures: final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. strict_failure_breakdown: final_answer_missing_expected_text=3, final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. Reference run via Hermes openai-codex OAuth adapter.
Artifact/log: -Historical Runs
2 eval runs
Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate excluding exact final-text phrasing. tool_state=91%, strict=72%, final_text=72%, schema_valid=100%, required_tool=97%, hallucinated_tool=0%, forbidden_tool=0%, recovery=57.1%. output_tps=13.9, input_tokens=42431, output_tokens=7688. non_text_failures: final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. strict_failure_breakdown: final_answer_missing_expected_text=28, final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. Reference run via Hermes openai-codex OAuth adapter.
Artifact/log: -Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=76%, strict=76%, final_text=94%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.58, output_tps=13.5, input_tokens=62360, output_tokens=7630. non_text_failures: final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. strict_failure_breakdown: final_answer_missing_expected_text=3, final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. Reference run via Hermes openai-codex OAuth adapter.
Artifact/log: -