LLM Learning Portal

Evaluating LLMs

18/30

The Evaluation Challenge

Evaluating LLMs is complex due to their general-purpose nature, open-ended outputs, and the subjective nature of many tasks.

Dimensions of Evaluation

  • Capability

    Can the model perform specific tasks successfully?

  • Reliability

    Does the model produce consistent, factual, and accurate outputs?

  • Safety

    Does the model avoid harmful, biased, or unethical outputs?

  • Efficiency

    How performant is the model in terms of speed, cost, and resource usage?

Evaluation Methods

Benchmarks

Standardized test datasets with defined metrics

Examples: MMLU, HELM, BIG-Bench

Human Evaluation

Direct assessment by human evaluators

Methods: A/B testing, Likert scales, preference ranking

Model-based Evaluation

Using other models to evaluate model outputs

Examples: GPT-4 as a judge, reward models

Red-Teaming

Deliberately trying to find model weaknesses

Focuses on: Adversarial attacks, prompt injections, jailbreaks

Popular Benchmarks

MMLU (Massive Multitask Language Understanding)

Multi-choice questions across 57 subjects from elementary to professional levels

Subjects include:

Mathematics
Medicine
Law
Ethics
Physics
Psychology

HumanEval & MBPP

Code generation and problem-solving benchmarks

HumanEval

164 Python programming problems with unit tests

MBPP

974 Python programming problems with test cases

TruthfulQA

Measures model tendency to reproduce falsehoods commonly believed by humans

"Questions designed to elicit false beliefs from models trained on human-written text"

GSM8K & MATH

Mathematical reasoning benchmarks

GSM8K

8,500 grade school math word problems

MATH

12,500 challenging competition math problems

Evaluation Metrics & Advanced Approaches

Traditional NLP Metrics

Accuracy & F1 Score

For classification-based tasks

Used for: Multiple choice, yes/no questions

BLEU, ROUGE, METEOR

Text overlap metrics for generation tasks

Used for: Summarization, translation

Perplexity

Measures how well a model predicts text

Lower values indicate better prediction
Traditional metrics often fail to capture the nuanced quality of LLM outputs - especially when multiple valid answers exist.

Emerging Evaluation Approaches

LLM-as-Judge

Using powerful LLMs to evaluate outputs of other models

Models like GPT-4 can provide nuanced evaluations of model outputs based on multiple criteria

Examples: Anthropic's Constitutional AI, LLM-as-Judge frameworks
Multi-dimensional Rubrics

Evaluating outputs on multiple specific dimensions

Factuality
Coherence
Relevance
Helpfulness
Conciseness
Safety
Real-world Task Completion

Measuring success on end-to-end practical tasks

Examples: Web navigation, API utilization, multi-step problem solving

Case Study: Evaluating Chatbots

Capability

Benchmarks:

  • MMLU for knowledge
  • GSM8K for reasoning
  • HumanEval for coding

Human Assessment:

  • Task completion success
  • Quality of solutions
Safety & Alignment

Adversarial Testing:

  • Red team challenges
  • Jailbreak resistance
  • Bias assessments

Value Alignment:

  • Helpfulness without harm
  • Refusal of unethical requests
User Experience

Quality Metrics:

  • Response relevance
  • Coherence & clarity
  • Conciseness

User Feedback:

  • Satisfaction surveys
  • User preference ratings
  • Retention metrics
Leaderboard Example: Chatbot Arena
Rank Model Elo Rating Win Rate Notable Strengths
1 Claude 3 Opus 1225 65% Reasoning, instruction following
2 GPT-4 1220 63% Knowledge, versatility
3 Claude 3 Sonnet 1175 55% Balanced, efficient
4 Llama 3 70B 1155 52% Open-source leadership
5 GPT-3.5 Turbo 1105 45% Speed, efficiency
Note: Ratings are illustrative examples - actual ratings change regularly