From Zero to Production in 7 Steps
A production-ready Retrieval-Augmented Generation (RAG) system that:
- Costs 10-20x less than pure LLM context
- Handles millions of documents
- Responds in <2 seconds
- Scales to thousands of users
Bad: Split every 512 tokens β Loses context Good: Semantic chunking β Keeps meaning intact
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Smart chunking with overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Optimal for most use cases
chunk_overlap=200, # Preserve context between chunks
separators=["\n\n", "\n", ". ", " ", ""] # Semantic boundaries
)
chunks = splitter.split_documents(documents)| Document Type | Chunk Size | Overlap |
|---|---|---|
| Technical docs | 800-1200 | 200 |
| Code files | 1500-2000 | 300 |
| Chat messages | 300-500 | 50 |
| Books/Articles | 1000-1500 | 250 |
1. OpenAI text-embedding-3-large
- Dimensions: 3072
- Cost: $0.13/1M tokens
- Best for: General purpose, high accuracy
2. Cohere embed-english-v3
- Dimensions: 1024
- Cost: $0.10/1M tokens
- Best for: English content, speed
3. Sentence Transformers (Free!)
- Model:
all-MiniLM-L6-v2 - Dimensions: 384
- Best for: Self-hosted, budget projects
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-large",
dimensions=1536 # Can reduce for speed
)
# Batch process for efficiency
vectors = embeddings.embed_documents(chunks)| Database | Best For | Pricing | Max Vectors |
|---|---|---|---|
| Pinecone | Production, managed | $70/mo | Billions |
| Weaviate | Self-hosted, flexible | Free | Millions |
| ChromaDB | Local dev, prototyping | Free | Thousands |
| Qdrant | High performance | Free tier | Millions |
import pinecone
# Initialize
pinecone.init(
api_key="your-key",
environment="us-west1-gcp"
)
# Create index
index = pinecone.Index("my-rag-index")
# Upsert vectors
index.upsert(vectors=[
("id1", embedding1, {"text": chunk1, "metadata": {...}}),
("id2", embedding2, {"text": chunk2, "metadata": {...}})
])def hybrid_search(query, alpha=0.7):
# 1. Dense vector search
vector_results = vector_db.search(
query_embedding=embed(query),
top_k=20
)
# 2. Keyword search (BM25)
keyword_results = bm25_search(query, top_k=20)
# 3. Combine with weight
final_results = combine_results(
vector_results,
keyword_results,
alpha=alpha # 0.7 = 70% vector, 30% keyword
)
return final_results[:5] # Top 5from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Re-rank retrieved chunks
scores = reranker.predict([
(query, chunk.text) for chunk in retrieved_chunks
])
reranked_chunks = sorted(
zip(chunks, scores),
key=lambda x: x[1],
reverse=True
)[:3] # Top 3 after re-rankingfrom langchain.prompts import PromptTemplate
template = """
You are a helpful AI assistant. Use the context below to answer the question.
If you don't know, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer: Let me help you with that.
"""
prompt = PromptTemplate(
template=template,
input_variables=["context", "question"]
)from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(model="gpt-4-turbo"),
retriever=vector_db.as_retriever(search_kwargs={"k": 5}),
chain_type="stuff", # Or "map_reduce" for long contexts
return_source_documents=True
)
result = qa_chain({"query": "How do I implement authentication?"})
print(result['result'])
print(result['source_documents']) # Show sourcesfrom functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def cached_rag(query_hash):
# Expensive RAG operation
return rag_chain(query)
# Use it
query_hash = hashlib.md5(query.encode()).hexdigest()
response = cached_rag(query_hash)# Save embeddings to avoid re-computing
import pickle
# Save
with open('embeddings.pkl', 'wb') as f:
pickle.dump(embeddings, f)
# Load
with open('embeddings.pkl', 'rb') as f:
cached_embeddings = pickle.load(f)def evaluate_rag_quality(test_set):
metrics = {
'retrieval_accuracy': 0, # Are right docs retrieved?
'answer_relevance': 0, # Is answer on-topic?
'faithfulness': 0, # Does answer match context?
'latency': 0 # Response time
}
for question, expected in test_set:
# Measure retrieval
retrieved = retriever.get_relevant_documents(question)
metrics['retrieval_accuracy'] += precision_at_k(retrieved, expected)
# Measure answer quality
answer = rag_chain(question)
metrics['answer_relevance'] += cosine_similarity(answer, question)
return {k: v/len(test_set) for k, v in metrics.items()}import logging
logger = logging.getLogger('rag_system')
def monitored_rag(query):
start = time.time()
try:
result = rag_chain(query)
# Log metrics
logger.info({
'query': query,
'latency': time.time() - start,
'chunks_retrieved': len(result['source_documents']),
'status': 'success'
})
return result
except Exception as e:
logger.error(f"RAG failed: {e}")
return fallback_response(query)# Complete RAG in 50 lines
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
import pinecone
# 1. Load docs
loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
# 3. Embed
embeddings = OpenAIEmbeddings()
# 4. Store
pinecone.init(api_key="your-key", environment="us-west1-gcp")
vectordb = Pinecone.from_documents(chunks, embeddings, index_name="my-index")
# 5. Create chain
qa = RetrievalQA.from_chain_type(
llm=OpenAI(model="gpt-4-turbo"),
retriever=vectordb.as_retriever(),
return_source_documents=True
)
# 6. Query
result = qa({"query": "Your question here"})
print(result['result'])1. Smaller Embeddings
# 3072 dims β 1536 dims = 50% cost reduction
embeddings = OpenAIEmbeddings(dimensions=1536)2. Batch Processing
# Process in batches to reduce API calls
batch_size = 100
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
embeddings = embed_batch(batch)3. Smart Caching
- Cache embeddings: 90% reduction
- Cache responses: 80% reduction
- Use smaller models for simple queries
Solution:
- Increase chunk overlap (200 β 300)
- Try hybrid search
- Add re-ranking
Solution:
- Reduce top_k (10 β 5)
- Use smaller embeddings
- Enable caching
Solution:
- Add "stay faithful to context" in prompt
- Lower LLM temperature (0.7 β 0.3)
- Return source documents for verification
- Advanced RAG: Multi-hop reasoning, parent-child chunking
- Agents: Combine RAG with tool-calling agents
- Fine-tuning: Custom embeddings for your domain
π― Your RAG System Checklist
- Documents chunked semantically
- Embeddings optimized for use case
- Vector DB indexed and queryable
- Retrieval tested with eval set
- LLM integration complete
- Caching enabled
- Monitoring in place
- Cost optimized
Built a RAG system? Share your experience!
Questions? Issues? Let's solve them together.