Inference Optimization

16/30

The Inference Challenge

Deploying LLMs efficiently requires optimizing computational resources without sacrificing quality.

Key Inference Metrics

Latency
Time to first token and time between tokens
Throughput
Number of tokens generated per second
Memory Utilization
VRAM required to run model inference
Cost Efficiency
Compute resources per token/request

Inference vs. Training

Aspect	Training	Inference
Process	Batch-parallel	Auto-regressive
Computation	Forward + Backward	Forward only
Memory usage	Very high	Moderate
Time constraints	Less critical	User-facing

Model Size Reduction

Quantization

Reducing numerical precision of model weights and activations:

FP32

32-bit float

Base precision

FP16

16-bit float

2x smaller

INT8

8-bit integer

4x smaller

INT4

4-bit integer

8x smaller

Popular quantization methods:

GPTQ: Post-training quantization optimized for transformers
AWQ: Activation-aware weight quantization
QLoRA: Quantized weights with trainable adapters

Pruning & Distillation

Model Pruning

Removing less important weights or entire attention heads

Speed: ★★★☆☆ Quality loss: ★★☆☆☆

Knowledge Distillation

Training smaller "student" models to mimic larger "teacher" models

Speed: ★★★★☆ Quality loss: ★★★☆☆

Structured Pruning

Removing entire layers or components for hardware efficiency

Speed: ★★★★★ Quality loss: ★★★★☆

Inference Optimizations

Algorithmic Optimizations

Key-Value Caching

Storing previously computed key-value pairs to avoid redundant computations

Reduces computation from O(n²) to O(n) for sequence length

Flash Attention

Memory-efficient attention algorithm

Reduces memory by tiling computations; up to 3x faster

Continuous Batching

Dynamic batching of requests as they arrive

Maximizes GPU utilization with variable-length requests

Hardware Acceleration

Tensor Cores & TPUs

Specialized hardware for matrix operations

Up to 5x speedup for compatible operations

Multi-GPU Inference

Distributing model layers across multiple GPUs

Enables running models too large for a single GPU

CPU Offloading

Moving parts of the model to CPU when not in active use

Reduces VRAM requirements with some latency cost

Popular Inference Frameworks

vLLM

TensorRT-LLM

DeepSpeed Inference

GGML / llama.cpp

Decoding Strategies

Greedy Decoding

Always select the most likely next token

Pros:

Fast, deterministic

Cons:

Repetitive, lacks creativity

Beam Search

Maintain top-k probable sequences

Pros:

Better quality than greedy

Cons:

Higher memory, still rigid

Sampling with Temperature

Sample from probability distribution

Pros:

Creative, diverse outputs

Cons:

Can produce errors/inconsistencies

Other popular techniques include nucleus (top-p) sampling, repetition penalties, and length normalization.

Previous All Slides Next