GPT-5.5 | Benchmaxxing model results

Score Summary

Overall eligibility requires eligible data in all four categories.

Overall

IncompleteIncomplete

Incomplete for overall leaderboards

Agentic Tool Use

Eligible81.3

2/2 benchmarks covered

Agentic Coding

No dataNo data

0/0 benchmarks covered

Long-Term Tasks

No dataNo data

0/0 benchmarks covered

Speed

No dataNo data

0/0 benchmarks covered

Latest Results

2 eval runs

Hermes Tool ContractAgentic Tool Use / Hermes Agent Evals hermes_tool_contract_v0

EligibleJul 4, 2026

Raw score91.0

Normalized91.0

Pass rate91.0%

Latency5.54s

Completion / TPS554s / 13.9 tok/s

Cost-

Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate excluding exact final-text phrasing. tool_state=91%, strict=72%, final_text=72%, schema_valid=100%, required_tool=97%, hallucinated_tool=0%, forbidden_tool=0%, recovery=57.1%. output_tps=13.9, input_tokens=42431, output_tokens=7688. non_text_failures: final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. strict_failure_breakdown: final_answer_missing_expected_text=28, final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. Reference run via Hermes openai-codex OAuth adapter.

Artifact/log: -

Hermes Tool Contract Hard v1Agentic Tool Use / Hermes Agent Evals hermes_tool_contract_hard_v1

EligibleJul 4, 2026

Raw score76.0

Normalized76.0

Pass rate76.0%

Latency11.3s

Completion / TPS564s / 13.5 tok/s

Cost-

Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=76%, strict=76%, final_text=94%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.58, output_tps=13.5, input_tokens=62360, output_tokens=7630. non_text_failures: final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. strict_failure_breakdown: final_answer_missing_expected_text=3, final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. Reference run via Hermes openai-codex OAuth adapter.

Artifact/log: -

Historical Runs

2 eval runs

Hermes Tool ContractAgentic Tool Use / Hermes Agent Evals hermes_tool_contract_v0

EligibleJul 4, 2026

Raw score91.0

Normalized91.0

Pass rate91.0%

Latency5.54s

Completion / TPS554s / 13.9 tok/s

Cost-

Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate excluding exact final-text phrasing. tool_state=91%, strict=72%, final_text=72%, schema_valid=100%, required_tool=97%, hallucinated_tool=0%, forbidden_tool=0%, recovery=57.1%. output_tps=13.9, input_tokens=42431, output_tokens=7688. non_text_failures: final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. strict_failure_breakdown: final_answer_missing_expected_text=28, final_state_wrong=3, max_tool_calls=4, missing_required_tool=3. Reference run via Hermes openai-codex OAuth adapter.

Artifact/log: -

Hermes Tool Contract Hard v1Agentic Tool Use / Hermes Agent Evals hermes_tool_contract_hard_v1

EligibleJul 4, 2026

Raw score76.0

Normalized76.0

Pass rate76.0%

Latency11.3s

Completion / TPS564s / 13.5 tok/s

Cost-

Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=76%, strict=76%, final_text=94%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.58, output_tps=13.5, input_tokens=62360, output_tokens=7630. non_text_failures: final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. strict_failure_breakdown: final_answer_missing_expected_text=3, final_state_wrong=7, max_tool_calls=5, missing_required_tool=4, too_many_tool_calls=7. Reference run via Hermes openai-codex OAuth adapter.

Artifact/log: -