LLM Learning Portal

Inference Optimization

16/30

The Inference Challenge

Deploying LLMs efficiently requires optimizing computational resources without sacrificing quality.

Key Inference Metrics

  • Latency

    Time to first token and time between tokens

  • Throughput

    Number of tokens generated per second

  • Memory Utilization

    VRAM required to run model inference

  • Cost Efficiency

    Compute resources per token/request

Inference vs. Training

Aspect Training Inference
Process Batch-parallel Auto-regressive
Computation Forward + Backward Forward only
Memory usage Very high Moderate
Time constraints Less critical User-facing

Model Size Reduction

Quantization

Reducing numerical precision of model weights and activations:

FP32

32-bit float

Base precision

FP16

16-bit float

2x smaller

INT8

8-bit integer

4x smaller

INT4

4-bit integer

8x smaller

Popular quantization methods:

  • GPTQ: Post-training quantization optimized for transformers
  • AWQ: Activation-aware weight quantization
  • QLoRA: Quantized weights with trainable adapters

Pruning & Distillation

Model Pruning

Removing less important weights or entire attention heads

Speed: ★★★☆☆ Quality loss: ★★☆☆☆

Knowledge Distillation

Training smaller "student" models to mimic larger "teacher" models

Speed: ★★★★☆ Quality loss: ★★★☆☆

Structured Pruning

Removing entire layers or components for hardware efficiency

Speed: ★★★★★ Quality loss: ★★★★☆

Inference Optimizations

Algorithmic Optimizations

Key-Value Caching

Storing previously computed key-value pairs to avoid redundant computations

Reduces computation from O(n²) to O(n) for sequence length

Flash Attention

Memory-efficient attention algorithm

Reduces memory by tiling computations; up to 3x faster

Continuous Batching

Dynamic batching of requests as they arrive

Maximizes GPU utilization with variable-length requests

Hardware Acceleration

Tensor Cores & TPUs

Specialized hardware for matrix operations

Up to 5x speedup for compatible operations

Multi-GPU Inference

Distributing model layers across multiple GPUs

Enables running models too large for a single GPU

CPU Offloading

Moving parts of the model to CPU when not in active use

Reduces VRAM requirements with some latency cost

Popular Inference Frameworks

vLLM

TensorRT-LLM

DeepSpeed Inference

GGML / llama.cpp

Decoding Strategies

Greedy Decoding

Always select the most likely next token

Pros:

  • Fast, deterministic

Cons:

  • Repetitive, lacks creativity
Beam Search

Maintain top-k probable sequences

Pros:

  • Better quality than greedy

Cons:

  • Higher memory, still rigid
Sampling with Temperature

Sample from probability distribution

Pros:

  • Creative, diverse outputs

Cons:

  • Can produce errors/inconsistencies
Other popular techniques include nucleus (top-p) sampling, repetition penalties, and length normalization.