Qwen3.5 9B MLX 4bit | Benchmaxxing model results

Score Summary

Overall eligibility requires eligible data in all four categories.

Overall

IncompleteIncomplete

Incomplete for overall leaderboards

Agentic Tool Use

Eligible74.0

2/2 benchmarks covered

Agentic Coding

No dataNo data

0/0 benchmarks covered

Long-Term Tasks

No dataNo data

0/0 benchmarks covered

Speed

No dataNo data

0/0 benchmarks covered

Latest Results

2 eval runs

Hermes Tool Contract Hard v1Agentic Tool Use / Hermes Agent Evals hermes_tool_contract_hard_v1

EligibleJul 5, 2026

Raw score68.0

Normalized68.0

Pass rate68.0%

Latency21.4s

Completion / TPS1069s / 18.7 tok/s

Cost$0.00

Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=68%, strict=68%, final_text=100%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.52, output_tps=18.7, input_tokens=146562, output_tokens=19980. non_text_failures: final_state_wrong=5, max_tool_calls=2, missing_required_tool=4, too_many_tool_calls=12. strict_failure_breakdown: final_state_wrong=5, max_tool_calls=2, missing_required_tool=4, too_many_tool_calls=12. Local OMLX Qwen3.5 9B 4-bit run.

Artifact/log: -

Hermes Tool ContractAgentic Tool Use / Hermes Agent Evals hermes_tool_contract_v0

EligibleJul 5, 2026

Raw score85.0

Normalized85.0

Pass rate85.0%

Latency10.8s

Completion / TPS1079s / 22.6 tok/s

Cost$0.00

Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=85%, strict=80%, final_text=88%, schema_valid=100%, required_tool=95%, hallucinated_tool=0%, forbidden_tool=1%, recovery=85.7%. avg_tool_calls=1.52, output_tps=22.6, input_tokens=114739, output_tokens=24411. non_text_failures: final_state_wrong=5, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. strict_failure_breakdown: final_answer_missing_expected_text=12, final_state_wrong=5, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. Local OMLX Qwen3.5 9B 4-bit run.

Artifact/log: -

Historical Runs

2 eval runs

Hermes Tool Contract Hard v1Agentic Tool Use / Hermes Agent Evals hermes_tool_contract_hard_v1

EligibleJul 5, 2026

Raw score68.0

Normalized68.0

Pass rate68.0%

Latency21.4s

Completion / TPS1069s / 18.7 tok/s

Cost$0.00

Hermes tool-contract hard v1: 50 cases. Primary score is tool/state pass rate. tool_state=68%, strict=68%, final_text=100%, schema_valid=100%, required_tool=92%, hallucinated_tool=0%, forbidden_tool=0%, recovery=80.0%. avg_tool_calls=3.52, output_tps=18.7, input_tokens=146562, output_tokens=19980. non_text_failures: final_state_wrong=5, max_tool_calls=2, missing_required_tool=4, too_many_tool_calls=12. strict_failure_breakdown: final_state_wrong=5, max_tool_calls=2, missing_required_tool=4, too_many_tool_calls=12. Local OMLX Qwen3.5 9B 4-bit run.

Artifact/log: -

Hermes Tool ContractAgentic Tool Use / Hermes Agent Evals hermes_tool_contract_v0

EligibleJul 5, 2026

Raw score85.0

Normalized85.0

Pass rate85.0%

Latency10.8s

Completion / TPS1079s / 22.6 tok/s

Cost$0.00

Hermes tool-contract v0: 100 cases. Primary score is tool/state pass rate. tool_state=85%, strict=80%, final_text=88%, schema_valid=100%, required_tool=95%, hallucinated_tool=0%, forbidden_tool=1%, recovery=85.7%. avg_tool_calls=1.52, output_tps=22.6, input_tokens=114739, output_tokens=24411. non_text_failures: final_state_wrong=5, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. strict_failure_breakdown: final_answer_missing_expected_text=12, final_state_wrong=5, forbidden_tool=1, max_tool_calls=10, missing_required_tool=5. Local OMLX Qwen3.5 9B 4-bit run.

Artifact/log: -