Evaluating an LLM's capabilities requires diverse benchmarks across multiple dimensions:
Metric Type | Measures | Examples |
---|---|---|
Intrinsic | Next-token prediction accuracy | Perplexity, Loss |
Academic | Standard NLP task performance | GLUE, SuperGLUE |
Knowledge | Factual recall and accuracy | TruthfulQA, MMLU |
Reasoning | Logical and mathematical abilities | GSM8K, MATH, BBH |
Human-like | Natural language generation quality | Human evaluations, Turing tests |
Tests knowledge across 57 subjects including STEM, humanities, law, and more
8,000 grade school math word problems requiring multi-step reasoning
164 programming problems to test code generation capabilities
Measures model tendency to reproduce falsehoods commonly believed by humans
Model performance increases with scale, but with diminishing returns in some areas
Top models approach perfect performance on older benchmarks, necessitating creation of harder tests
Models may have seen benchmark data during pre-training, invalidating test results
Performance varies dramatically based on how questions are phrased
Many qualities (helpfulness, creativity) require subjective human judgment
Some capabilities only emerge at certain scale thresholds rather than improving gradually
The ultimate measure of an LLM's value is not benchmark scores but its ability to solve real human problems and add value in practical applications.