⨀ AI BENCHMARK HUB

live leaderboard Β· independent LLM evaluation

πŸ”„ sync: just now
πŸ“Š Industry benchmark (Q2 2025)
Standardised evaluation: MMLU (pro), HumanEval, GPQA, latency & pricing.
Model (latest SOTA) MMLU (5-shot) HumanEval (pass@1) Latency (ms/token) Price (input) $/1M tokens
* latency measured on H100 SXM (p99, 1024 in/256 out)

πŸ“ˆ MMLU score comparison

πŸ’» HumanEval (code generation)

🧠 Composite intelligence ranking
Normalized scores: reasoning, coding, multilingual, instruction following (0-100)
πŸ† Top performers (overall)
⚑ Throughput (tok/sec) – efficiency champion
πŸ“Œ Daily refreshed benchmark pool β€” data based on open leaderboard (max 2 quarters old)
πŸ“ Benchmarks: MMLU, HumanEval, MATH, MT-Bench, GPQA (diamond set) βš™οΈ Hardware: NVIDIA H100 SXM / vLLM 0.6.1
⟳ Run live evaluation