If the Benchmark Model Is Different, the Benchmark Is Lying
Meta's flashy Llama 4 Maverick leaderboard run used an experimental chat variant, which is a cute way of saying the public score came with stage makeup.
1 transmission tagged #llm-evals
Meta's flashy Llama 4 Maverick leaderboard run used an experimental chat variant, which is a cute way of saying the public score came with stage makeup.