Retrieval-Augmented Generation - LLM Learning Portal

Retrieval-Augmented Generation (RAG)

20/30

RAG Fundamentals

Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving relevant information from external knowledge sources and incorporating it into the generation process.

Core Advantages

Up-to-date Information
LLMs can access information beyond their training cutoff date
Factual Grounding
Reduces hallucinations by connecting responses to verified sources
Domain Adaptation
Tailors general LLMs to specific domains without fine-tuning
Source Attribution
Enables citation and verification of information sources

Why RAG over Fine-tuning?

Aspect	RAG	Fine-tuning
Information updates	Dynamic & immediate	Requires retraining
Implementation	Simpler architecture	Complex training process
Resource requirements	Lower compute needs	High compute & data needs
Data transparency	Clear provenance	Black-box knowledge
Best for	Knowledge-intensive tasks	Style & capability adaption

In practice, RAG and fine-tuning are often used together for optimal results

RAG Architecture

Basic RAG Pipeline

1. User Query

2. Retrieval System

Finds relevant documents

3. Context Augmentation

Enhances prompt with retrieved info

4. LLM Generation

Creates response using augmented context

Key Components

Indexing Pipeline

Processes and organizes knowledge sources for efficient retrieval

Document ingestion (PDFs, websites, databases)
Chunking into manageable segments
Embedding generation with vector models
Storage in vector databases for semantic search

Retrieval Mechanisms

Methods for finding the most relevant context

Dense Retrieval

Uses semantic similarity between query and document embeddings

Sparse Retrieval

Keyword-based methods like BM25, TF-IDF

Hybrid Retrieval

Combines dense and sparse methods for better results

Re-ranking

Further refines initial search results for relevance

Integration with LLM

How retrieved information is used in generation

Prompt Engineering

Structuring retrieved content effectively in the prompt

Contextual Relevance

Ensuring retrieved information actually addresses the query

Citation & Attribution

Tracking sources through the generation process

Advanced RAG Techniques

Enhanced Retrieval

Query Transformation

Improving query effectiveness before retrieval

Query Expansion

Adding related terms to improve recall

Hypothetical Document Embeddings (HyDE)

Using an LLM to generate a hypothetical answer, then embedding that for retrieval

Query Decomposition

Breaking complex queries into simpler sub-queries

Multi-step Retrieval

Iterative approaches to finding better context

Self-RAG

LLM evaluates and improves its own retrievals

FLARE (Forward-Looking Active REtrieval)

Dynamically retrieves information during generation

ReAct

Interleaving reasoning and retrieval actions

Optimizing RAG Systems

Chunking Strategies

Better document segmentation for more precise retrieval

Fixed-size Chunks

Simple but may split related content

Semantic Chunking

Based on content meaning & structure

Hierarchical Chunking

Multiple levels of granularity

Sliding Window

Overlapping chunks to preserve context

Context Integration

Better ways to incorporate retrieved information

Fusion-in-Decoder

Processing multiple retrieved passages in parallel

Context Compression

Summarizing or distilling retrieved documents before use

Weighted Fusion

Prioritizing more relevant contexts in the final response

Building Effective RAG Systems

Data Preparation

Considerations:

Content quality & relevance
Data cleaning & normalization
Metadata enrichment
Regular updates & versioning

Technical Implementation

Popular Tools:

LangChain

LlamaIndex

Pinecone

Weaviate

Chroma

Qdrant

Evaluation

Key Metrics:

Relevance of retrieved contexts
Answer correctness & completeness
Hallucination reduction
End-to-end latency

The quality of RAG systems is fundamentally limited by two factors: the quality of the retrieval component and the ability of the LLM to properly utilize the retrieved information.

Implementation Example

from langchain import LLMChain
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

# 1. Create vector store from documents
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)

# 2. Create retriever
retriever = vectorstore.as_retriever()

# 3. Define RAG prompt template
template = """Answer based on the following context:
{context}

Question: {question}
Answer: """

# 4. Create chain that combines retrieval and generation
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | PromptTemplate.from_template(template)
    | OpenAI()
)

Case Studies

Enterprise Search

Connecting LLMs to internal documents, wikis, and knowledge bases

Examples: Perplexity AI, GitHub Copilot for Business

Legal Contract Analysis

RAG systems that connect to case law and precedent databases

Examples: Harvey AI, Casetext CoCounsel

Medical Decision Support

Systems that retrieve from medical literature and patient records

Examples: Mayo Clinic AI, Nabla Copilot

Previous All Slides Next