Evaluating LLMs

18/30

The Evaluation Challenge

Evaluating LLMs is complex due to their general-purpose nature, open-ended outputs, and the subjective nature of many tasks.

Dimensions of Evaluation

Capability
Can the model perform specific tasks successfully?
Reliability
Does the model produce consistent, factual, and accurate outputs?
Safety
Does the model avoid harmful, biased, or unethical outputs?
Efficiency
How performant is the model in terms of speed, cost, and resource usage?

Evaluation Methods

Benchmarks

Standardized test datasets with defined metrics

Examples: MMLU, HELM, BIG-Bench

Human Evaluation

Direct assessment by human evaluators

Methods: A/B testing, Likert scales, preference ranking

Model-based Evaluation

Using other models to evaluate model outputs

Examples: GPT-4 as a judge, reward models

Red-Teaming

Deliberately trying to find model weaknesses

Focuses on: Adversarial attacks, prompt injections, jailbreaks

Popular Benchmarks

MMLU (Massive Multitask Language Understanding)

Multi-choice questions across 57 subjects from elementary to professional levels

Subjects include:

Mathematics

Medicine

Law

Ethics

Physics

Psychology

HumanEval & MBPP

Code generation and problem-solving benchmarks

HumanEval

164 Python programming problems with unit tests

MBPP

974 Python programming problems with test cases

TruthfulQA

Measures model tendency to reproduce falsehoods commonly believed by humans

"Questions designed to elicit false beliefs from models trained on human-written text"

GSM8K & MATH

Mathematical reasoning benchmarks

GSM8K

8,500 grade school math word problems

MATH

12,500 challenging competition math problems

Evaluation Metrics & Advanced Approaches

Traditional NLP Metrics

Accuracy & F1 Score

For classification-based tasks

Used for: Multiple choice, yes/no questions

BLEU, ROUGE, METEOR

Text overlap metrics for generation tasks

Used for: Summarization, translation

Perplexity

Measures how well a model predicts text

Lower values indicate better prediction

Traditional metrics often fail to capture the nuanced quality of LLM outputs - especially when multiple valid answers exist.

Emerging Evaluation Approaches

LLM-as-Judge

Using powerful LLMs to evaluate outputs of other models

Models like GPT-4 can provide nuanced evaluations of model outputs based on multiple criteria

Examples: Anthropic's Constitutional AI, LLM-as-Judge frameworks

Multi-dimensional Rubrics

Evaluating outputs on multiple specific dimensions

Factuality

Coherence

Relevance

Helpfulness

Conciseness

Safety

Real-world Task Completion

Measuring success on end-to-end practical tasks

Examples: Web navigation, API utilization, multi-step problem solving

Case Study: Evaluating Chatbots

Capability

Benchmarks:

MMLU for knowledge
GSM8K for reasoning
HumanEval for coding

Human Assessment:

Task completion success
Quality of solutions

Safety & Alignment

Adversarial Testing:

Red team challenges
Jailbreak resistance
Bias assessments

Value Alignment:

Helpfulness without harm
Refusal of unethical requests

User Experience

Quality Metrics:

Response relevance
Coherence & clarity
Conciseness

User Feedback:

Satisfaction surveys
User preference ratings
Retention metrics

Leaderboard Example: Chatbot Arena

Rank	Model	Elo Rating	Win Rate	Notable Strengths
1	Claude 3 Opus	1225	65%	Reasoning, instruction following
2	GPT-4	1220	63%	Knowledge, versatility
3	Claude 3 Sonnet	1175	55%	Balanced, efficient
4	Llama 3 70B	1155	52%	Open-source leadership
5	GPT-3.5 Turbo	1105	45%	Speed, efficiency

Note: Ratings are illustrative examples - actual ratings change regularly

Previous All Slides Next