Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving relevant information from external knowledge sources and incorporating it into the generation process.
LLMs can access information beyond their training cutoff date
Reduces hallucinations by connecting responses to verified sources
Tailors general LLMs to specific domains without fine-tuning
Enables citation and verification of information sources
Aspect | RAG | Fine-tuning |
---|---|---|
Information updates | Dynamic & immediate | Requires retraining |
Implementation | Simpler architecture | Complex training process |
Resource requirements | Lower compute needs | High compute & data needs |
Data transparency | Clear provenance | Black-box knowledge |
Best for | Knowledge-intensive tasks | Style & capability adaption |
1. User Query
2. Retrieval System
Finds relevant documents
3. Context Augmentation
Enhances prompt with retrieved info
4. LLM Generation
Creates response using augmented context
Processes and organizes knowledge sources for efficient retrieval
Methods for finding the most relevant context
Dense Retrieval
Uses semantic similarity between query and document embeddings
Sparse Retrieval
Keyword-based methods like BM25, TF-IDF
Hybrid Retrieval
Combines dense and sparse methods for better results
Re-ranking
Further refines initial search results for relevance
How retrieved information is used in generation
Prompt Engineering
Structuring retrieved content effectively in the prompt
Contextual Relevance
Ensuring retrieved information actually addresses the query
Citation & Attribution
Tracking sources through the generation process
Improving query effectiveness before retrieval
Query Expansion
Adding related terms to improve recall
Hypothetical Document Embeddings (HyDE)
Using an LLM to generate a hypothetical answer, then embedding that for retrieval
Query Decomposition
Breaking complex queries into simpler sub-queries
Iterative approaches to finding better context
Self-RAG
LLM evaluates and improves its own retrievals
FLARE (Forward-Looking Active REtrieval)
Dynamically retrieves information during generation
ReAct
Interleaving reasoning and retrieval actions
Better document segmentation for more precise retrieval
Fixed-size Chunks
Simple but may split related content
Semantic Chunking
Based on content meaning & structure
Hierarchical Chunking
Multiple levels of granularity
Sliding Window
Overlapping chunks to preserve context
Better ways to incorporate retrieved information
Fusion-in-Decoder
Processing multiple retrieved passages in parallel
Context Compression
Summarizing or distilling retrieved documents before use
Weighted Fusion
Prioritizing more relevant contexts in the final response
Considerations:
Popular Tools:
Key Metrics:
from langchain import LLMChain
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
# 1. Create vector store from documents
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
# 2. Create retriever
retriever = vectorstore.as_retriever()
# 3. Define RAG prompt template
template = """Answer based on the following context:
{context}
Question: {question}
Answer: """
# 4. Create chain that combines retrieval and generation
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| PromptTemplate.from_template(template)
| OpenAI()
)
Enterprise Search
Connecting LLMs to internal documents, wikis, and knowledge bases
Examples: Perplexity AI, GitHub Copilot for Business
Legal Contract Analysis
RAG systems that connect to case law and precedent databases
Examples: Harvey AI, Casetext CoCounsel
Medical Decision Support
Systems that retrieve from medical literature and patient records
Examples: Mayo Clinic AI, Nabla Copilot