Back to dashboard

OpenAI Codex

GPT-5.5

OpenAI Codex OAuth provider. Reference run for local Hermes tool-contract model comparisons.

FamilyGPTVersiongpt-5.5Context-Latest runJul 4, 2026

Score Summary

Overall eligibility requires eligible data in all four categories.

Overall

IncompleteIncomplete

Incomplete for overall leaderboards

Agentic Tool Use

Eligible81.3

2/2 benchmarks covered

Agentic Coding

No dataNo data

0/0 benchmarks covered

Long-Term Tasks

No dataNo data

0/0 benchmarks covered

Speed

No dataNo data

0/0 benchmarks covered

Latest Results

2 eval runs

Hermes Tool ContractAgentic Tool Use / Hermes Agent Evals hermes_tool_contract_v0
EligibleJul 4, 2026
Raw score91.0
Normalized91.0
Pass rate91.0%
Latency5.54s
Completion / TPS554s / 13.9 tok/s
Cost-

Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate excluding exact final-text phrasing. tool_state=91%, strict=72%, final_text=72%, schema_valid=100%, required_tool=97%, hallucinated_tool=0%, forbidden_tool=0%, recovery=57.1%. output_tps=13.9, input_tokens=42431, output_tokens=7688. non_text_failures: final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. strict_failure_breakdown: final_answer_missing_expected_text=28, final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. Reference run via Hermes openai-codex OAuth adapter.

Artifact/log: -
Hermes Tool Contract Hard v1Agentic Tool Use / Hermes Agent Evals hermes_tool_contract_hard_v1
EligibleJul 4, 2026
Raw score76.0
Normalized76.0
Pass rate76.0%
Latency11.3s
Completion / TPS564s / 13.5 tok/s
Cost-

Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=76%, strict=76%, final_text=94%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.58, output_tps=13.5, input_tokens=62360, output_tokens=7630. non_text_failures: final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. strict_failure_breakdown: final_answer_missing_expected_text=3, final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. Reference run via Hermes openai-codex OAuth adapter.

Artifact/log: -

Historical Runs

2 eval runs

Hermes Tool ContractAgentic Tool Use / Hermes Agent Evals hermes_tool_contract_v0
EligibleJul 4, 2026
Raw score91.0
Normalized91.0
Pass rate91.0%
Latency5.54s
Completion / TPS554s / 13.9 tok/s
Cost-

Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate excluding exact final-text phrasing. tool_state=91%, strict=72%, final_text=72%, schema_valid=100%, required_tool=97%, hallucinated_tool=0%, forbidden_tool=0%, recovery=57.1%. output_tps=13.9, input_tokens=42431, output_tokens=7688. non_text_failures: final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. strict_failure_breakdown: final_answer_missing_expected_text=28, final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. Reference run via Hermes openai-codex OAuth adapter.

Artifact/log: -
Hermes Tool Contract Hard v1Agentic Tool Use / Hermes Agent Evals hermes_tool_contract_hard_v1
EligibleJul 4, 2026
Raw score76.0
Normalized76.0
Pass rate76.0%
Latency11.3s
Completion / TPS564s / 13.5 tok/s
Cost-

Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=76%, strict=76%, final_text=94%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.58, output_tps=13.5, input_tokens=62360, output_tokens=7630. non_text_failures: final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. strict_failure_breakdown: final_answer_missing_expected_text=3, final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. Reference run via Hermes openai-codex OAuth adapter.

Artifact/log: -