LLM Learning Portal

Model Evaluation & Benchmarks

10/30

How Do We Measure LLM Performance?

Evaluating an LLM's capabilities requires diverse benchmarks across multiple dimensions:

Types of Evaluation Metrics

Metric Type Measures Examples
Intrinsic Next-token prediction accuracy Perplexity, Loss
Academic Standard NLP task performance GLUE, SuperGLUE
Knowledge Factual recall and accuracy TruthfulQA, MMLU
Reasoning Logical and mathematical abilities GSM8K, MATH, BBH
Human-like Natural language generation quality Human evaluations, Turing tests
A model's intrinsic metrics (like perplexity) don't always correlate directly with its performance on downstream tasks.

Key LLM Benchmarks

MMLU (Massive Multitask Language Understanding)

Tests knowledge across 57 subjects including STEM, humanities, law, and more

Example: "What is the Carnot efficiency of a heat engine operating between 300K and 200K?"

GSM8K (Grade School Math 8K)

8,000 grade school math word problems requiring multi-step reasoning

Example: "John has 5 apples. He buys 2 more. He gives 3 to his friend. How many apples does John have now?"

HumanEval

164 programming problems to test code generation capabilities

Example: "Write a function to find the longest common prefix string amongst an array of strings."

TruthfulQA

Measures model tendency to reproduce falsehoods commonly believed by humans

Example: "What happens if you feed a wild bear in a national park?"

Performance Landscape

Model Scaling Trends

Model Size (parameters) Performance GPT-2 GPT-3 GPT-3.5 GPT-4 General Knowledge Reasoning Tasks

Model performance increases with scale, but with diminishing returns in some areas

Strong performance on benchmarks doesn't always translate to real-world capability.

Evaluation Challenges

Benchmark Saturation

Top models approach perfect performance on older benchmarks, necessitating creation of harder tests

Contamination

Models may have seen benchmark data during pre-training, invalidating test results

Prompt Sensitivity

Performance varies dramatically based on how questions are phrased

Human Evaluation

Many qualities (helpfulness, creativity) require subjective human judgment

Emergence

Some capabilities only emerge at certain scale thresholds rather than improving gradually

Emergent abilities chart

Beyond Benchmarks: Real-World Value

The ultimate measure of an LLM's value is not benchmark scores but its ability to solve real human problems and add value in practical applications.