Deploying LLMs efficiently requires optimizing computational resources without sacrificing quality.
Time to first token and time between tokens
Number of tokens generated per second
VRAM required to run model inference
Compute resources per token/request
Aspect | Training | Inference |
---|---|---|
Process | Batch-parallel | Auto-regressive |
Computation | Forward + Backward | Forward only |
Memory usage | Very high | Moderate |
Time constraints | Less critical | User-facing |
Reducing numerical precision of model weights and activations:
FP32
32-bit float
Base precision
FP16
16-bit float
2x smaller
INT8
8-bit integer
4x smaller
INT4
4-bit integer
8x smaller
Popular quantization methods:
Model Pruning
Removing less important weights or entire attention heads
Knowledge Distillation
Training smaller "student" models to mimic larger "teacher" models
Structured Pruning
Removing entire layers or components for hardware efficiency
Storing previously computed key-value pairs to avoid redundant computations
Reduces computation from O(n²) to O(n) for sequence length
Memory-efficient attention algorithm
Reduces memory by tiling computations; up to 3x faster
Dynamic batching of requests as they arrive
Maximizes GPU utilization with variable-length requests
Tensor Cores & TPUs
Specialized hardware for matrix operations
Multi-GPU Inference
Distributing model layers across multiple GPUs
CPU Offloading
Moving parts of the model to CPU when not in active use
Popular Inference Frameworks
vLLM
TensorRT-LLM
DeepSpeed Inference
GGML / llama.cpp
Always select the most likely next token
Pros:
Cons:
Maintain top-k probable sequences
Pros:
Cons:
Sample from probability distribution
Pros:
Cons: