LLM Learning Portal

Parameter-Efficient Fine-tuning

12/30

The Challenge of Full Fine-tuning

As LLMs grow larger, full fine-tuning becomes increasingly impractical:

Full Fine-tuning Limitations

  • Memory Requirements

    Updating all parameters requires storing optimizer states for every parameter (~3-4x model size)

  • Compute Costs

    Training a 7B parameter model can cost $1,000+ for a single run

  • Storage Overhead

    Each fine-tuned model is a complete copy (7B-175B parameters)

  • Deployment Complexity

    Managing multiple large model versions becomes unwieldy

The PEFT Solution

Parameter-Efficient Fine-Tuning (PEFT) methods train only a small subset of parameters while keeping most of the pre-trained model frozen.

1%

Efficiently Updating Parameters

Often only 1% or less of model parameters need to be updated to achieve comparable performance to full fine-tuning

PEFT Methods Overview

LoRA (Low-Rank Adaptation)

Adds small trainable low-rank matrices alongside frozen pre-trained weights

Key idea: Approximate weight updates with low-rank matrices (r << d)

Adapters

Insert small trainable modules between layers of the pre-trained model

Key idea: Down-project, apply non-linearity, then up-project back to original dimension

Prompt Tuning

Add trainable continuous embedding vectors to the input sequence

Key idea: Learn soft prompts that shape model behavior without changing model parameters

Prefix Tuning

Add trainable prefix parameters to each transformer layer

Key idea: Task-specific activations at each layer steer the model's behavior

QLoRA

Combines quantization with LoRA for even greater efficiency

Key idea: 4-bit quantization of base model with LoRA adapter training

LoRA: A Deeper Look

How LoRA Works

LoRA represents weight updates using low-rank decomposition:

W = W0 + ΔW

ΔW = BA

where B ∈ ℝd×r, A ∈ ℝr×k, and r << min(d,k)

Key Parameters:

  • Rank (r): Usually 4-256 (smaller = more efficient)
  • Alpha (α): Scaling factor for initialization
  • Target modules: Which weight matrices to adapt
LoRA Benefits
  • No inference latency (can be merged with original weights)
  • Dramatically reduced memory usage (10-100x less)
  • Small adapter size (10-100 MB vs. several GB)
  • Comparable performance to full fine-tuning
  • Adapters can be swapped without reloading base model

LoRA in Practice

Pre-trained Weights (W₀) [FROZEN] Input Output A [Trainable] B [Trainable] + Rank (r) = 8 Rank (r) = 8 Only ~0.1-1% of parameters are trained
Common Target Modules
Query/Key/Value matrices: Most commonly adapted in attention layers
Feed-forward projections: Often targeted in MLP layers
Output projections: Attention output and layer connections
With LoRA, you can fine-tune a 7B parameter model on a single consumer GPU with 16GB VRAM!

Comparing PEFT Methods

Method Parameter Efficiency Memory Usage Training Speed Performance
LoRA Very High Very Low Fast Excellent
Adapters High Low Medium Good
Prompt Tuning Extremely High Extremely Low Very Fast Variable
QLoRA Very High Extremely Low Medium Excellent