π Industry benchmark (Q2 2025)
Standardised evaluation: MMLU (pro), HumanEval, GPQA, latency & pricing.
| Model (latest SOTA) |
MMLU (5-shot) |
HumanEval (pass@1) |
Latency (ms/token) |
Price (input) $/1M tokens |
* latency measured on H100 SXM (p99, 1024 in/256 out)
π MMLU score comparison
π» HumanEval (code generation)
π§ Composite intelligence ranking
Normalized scores: reasoning, coding, multilingual, instruction following (0-100)
π Top performers (overall)
β‘ Throughput (tok/sec) β efficiency champion
π Daily refreshed benchmark pool β data based on open leaderboard (max 2 quarters old)
π Benchmarks: MMLU, HumanEval, MATH, MT-Bench, GPQA (diamond set)
βοΈ Hardware: NVIDIA H100 SXM / vLLM 0.6.1
β³ Run live evaluation