Skip to content

Latest commit

Β 

History

History
399 lines (300 loc) Β· 8.82 KB

File metadata and controls

399 lines (300 loc) Β· 8.82 KB

The Complete RAG Implementation Guide πŸ“–

From Zero to Production in 7 Steps


🎯 What You'll Build

A production-ready Retrieval-Augmented Generation (RAG) system that:

  • Costs 10-20x less than pure LLM context
  • Handles millions of documents
  • Responds in <2 seconds
  • Scales to thousands of users

Step 1: Document Processing & Chunking

The Chunking Problem

Bad: Split every 512 tokens β†’ Loses context Good: Semantic chunking β†’ Keeps meaning intact

Optimal Chunking Strategy

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Smart chunking with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,          # Optimal for most use cases
    chunk_overlap=200,        # Preserve context between chunks
    separators=["\n\n", "\n", ". ", " ", ""]  # Semantic boundaries
)

chunks = splitter.split_documents(documents)

Chunk Size Guide

Document Type Chunk Size Overlap
Technical docs 800-1200 200
Code files 1500-2000 300
Chat messages 300-500 50
Books/Articles 1000-1500 250

Step 2: Embedding Model Selection

Top Models (December 2024)

1. OpenAI text-embedding-3-large

  • Dimensions: 3072
  • Cost: $0.13/1M tokens
  • Best for: General purpose, high accuracy

2. Cohere embed-english-v3

  • Dimensions: 1024
  • Cost: $0.10/1M tokens
  • Best for: English content, speed

3. Sentence Transformers (Free!)

  • Model: all-MiniLM-L6-v2
  • Dimensions: 384
  • Best for: Self-hosted, budget projects

Implementation

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large",
    dimensions=1536  # Can reduce for speed
)

# Batch process for efficiency
vectors = embeddings.embed_documents(chunks)

Step 3: Vector Database Setup

Database Comparison

Database Best For Pricing Max Vectors
Pinecone Production, managed $70/mo Billions
Weaviate Self-hosted, flexible Free Millions
ChromaDB Local dev, prototyping Free Thousands
Qdrant High performance Free tier Millions

Pinecone Setup (Recommended)

import pinecone

# Initialize
pinecone.init(
    api_key="your-key",
    environment="us-west1-gcp"
)

# Create index
index = pinecone.Index("my-rag-index")

# Upsert vectors
index.upsert(vectors=[
    ("id1", embedding1, {"text": chunk1, "metadata": {...}}),
    ("id2", embedding2, {"text": chunk2, "metadata": {...}})
])

Step 4: Retrieval Strategy

Hybrid Search (Best Results)

def hybrid_search(query, alpha=0.7):
    # 1. Dense vector search
    vector_results = vector_db.search(
        query_embedding=embed(query),
        top_k=20
    )

    # 2. Keyword search (BM25)
    keyword_results = bm25_search(query, top_k=20)

    # 3. Combine with weight
    final_results = combine_results(
        vector_results,
        keyword_results,
        alpha=alpha  # 0.7 = 70% vector, 30% keyword
    )

    return final_results[:5]  # Top 5

Re-ranking (10-15% accuracy boost)

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Re-rank retrieved chunks
scores = reranker.predict([
    (query, chunk.text) for chunk in retrieved_chunks
])

reranked_chunks = sorted(
    zip(chunks, scores),
    key=lambda x: x[1],
    reverse=True
)[:3]  # Top 3 after re-ranking

Step 5: LLM Integration

Prompt Template

from langchain.prompts import PromptTemplate

template = """
You are a helpful AI assistant. Use the context below to answer the question.
If you don't know, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer: Let me help you with that.
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

Full RAG Chain

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="gpt-4-turbo"),
    retriever=vector_db.as_retriever(search_kwargs={"k": 5}),
    chain_type="stuff",  # Or "map_reduce" for long contexts
    return_source_documents=True
)

result = qa_chain({"query": "How do I implement authentication?"})
print(result['result'])
print(result['source_documents'])  # Show sources

Step 6: Optimization & Caching

Response Caching

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_rag(query_hash):
    # Expensive RAG operation
    return rag_chain(query)

# Use it
query_hash = hashlib.md5(query.encode()).hexdigest()
response = cached_rag(query_hash)

Embedding Caching

# Save embeddings to avoid re-computing
import pickle

# Save
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

# Load
with open('embeddings.pkl', 'rb') as f:
    cached_embeddings = pickle.load(f)

Step 7: Monitoring & Evaluation

Key Metrics

def evaluate_rag_quality(test_set):
    metrics = {
        'retrieval_accuracy': 0,  # Are right docs retrieved?
        'answer_relevance': 0,     # Is answer on-topic?
        'faithfulness': 0,         # Does answer match context?
        'latency': 0               # Response time
    }

    for question, expected in test_set:
        # Measure retrieval
        retrieved = retriever.get_relevant_documents(question)
        metrics['retrieval_accuracy'] += precision_at_k(retrieved, expected)

        # Measure answer quality
        answer = rag_chain(question)
        metrics['answer_relevance'] += cosine_similarity(answer, question)

    return {k: v/len(test_set) for k, v in metrics.items()}

Production Monitoring

import logging

logger = logging.getLogger('rag_system')

def monitored_rag(query):
    start = time.time()

    try:
        result = rag_chain(query)

        # Log metrics
        logger.info({
            'query': query,
            'latency': time.time() - start,
            'chunks_retrieved': len(result['source_documents']),
            'status': 'success'
        })

        return result
    except Exception as e:
        logger.error(f"RAG failed: {e}")
        return fallback_response(query)

πŸš€ Quick Start Template

# Complete RAG in 50 lines

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
import pinecone

# 1. Load docs
loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# 3. Embed
embeddings = OpenAIEmbeddings()

# 4. Store
pinecone.init(api_key="your-key", environment="us-west1-gcp")
vectordb = Pinecone.from_documents(chunks, embeddings, index_name="my-index")

# 5. Create chain
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(model="gpt-4-turbo"),
    retriever=vectordb.as_retriever(),
    return_source_documents=True
)

# 6. Query
result = qa({"query": "Your question here"})
print(result['result'])

πŸ’° Cost Optimization

Reduce Costs by 10x

1. Smaller Embeddings

# 3072 dims β†’ 1536 dims = 50% cost reduction
embeddings = OpenAIEmbeddings(dimensions=1536)

2. Batch Processing

# Process in batches to reduce API calls
batch_size = 100
for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i+batch_size]
    embeddings = embed_batch(batch)

3. Smart Caching

  • Cache embeddings: 90% reduction
  • Cache responses: 80% reduction
  • Use smaller models for simple queries

πŸ”§ Troubleshooting

Problem: Poor Retrieval

Solution:

  1. Increase chunk overlap (200 β†’ 300)
  2. Try hybrid search
  3. Add re-ranking

Problem: Slow Responses

Solution:

  1. Reduce top_k (10 β†’ 5)
  2. Use smaller embeddings
  3. Enable caching

Problem: Hallucinations

Solution:

  1. Add "stay faithful to context" in prompt
  2. Lower LLM temperature (0.7 β†’ 0.3)
  3. Return source documents for verification

πŸ“š Next Steps

  1. Advanced RAG: Multi-hop reasoning, parent-child chunking
  2. Agents: Combine RAG with tool-calling agents
  3. Fine-tuning: Custom embeddings for your domain

🎯 Your RAG System Checklist

  • Documents chunked semantically
  • Embeddings optimized for use case
  • Vector DB indexed and queryable
  • Retrieval tested with eval set
  • LLM integration complete
  • Caching enabled
  • Monitoring in place
  • Cost optimized

Built a RAG system? Share your experience!

Questions? Issues? Let's solve them together.