Building Production-Ready RAG Applications with Vector Databases
Most RAG prototypes look impressive in a notebook. Then they hit production and fall apart.
Latency spikes. Retrieval returns irrelevant chunks. Costs balloon when query volume scales. The gap between a working demo and a system you’d trust with real users is wider than most engineering teams expect—and it’s almost never the language model’s fault.
I’ve shipped a few of these systems now, and the pattern is consistent: the LLM is the easy part. What follows covers what that gap actually looks like and how to close it—from architecture decisions to vector database selection, chunking strategy, retrieval tuning, and the monitoring you need to keep things from quietly degrading over time.
What “Production-Ready” Actually Means for RAG
Before writing a line of code, it helps to be precise about what you’re building toward. A production-ready RAG application isn’t just one that works—it’s one that works predictably, degrades gracefully, and can be reasoned about when something goes wrong.
In practice, that means four things:
- Retrieval quality is measurable. You can run an evaluation suite and know whether a change improved or hurt your system.
- Latency is bounded. You have p95 and p99 numbers, and you’ve designed around them.
- Costs are predictable. You know roughly what a query costs and can model what happens at 10× volume.
- Failures are observable. Bad retrievals, hallucinations, and timeouts surface in your monitoring, not in user complaints.
Most teams skip one or more of these. In my experience, the evaluation piece is the one that bites hardest—teams that skip it often discover months later that their system has been silently degrading as their document corpus evolved. By then, the damage is already done.
Choosing Your Vector Database
The vector database is the heart of any RAG system, and the choice matters more than most teams assume. It affects not just retrieval performance but operational complexity, cost at scale, and what querying capabilities you get out of the box.
Managed vs. Self-Hosted
The first decision is whether to run managed infrastructure or self-host. Managed options like Pinecone, Weaviate Cloud, and Zilliz (Milvus-managed) remove operational burden in exchange for cost and some flexibility. Self-hosting Qdrant, Weaviate, or pgvector gives you more control but puts index management, backups, and scaling on your team.
A reasonable heuristic: if your team doesn’t have dedicated infrastructure engineers and you’re handling fewer than 50 million vectors, start managed. You can migrate later—the API surface for most vector DBs is small enough that the switch isn’t painful.
Metadata Filtering Is Non-Negotiable
Pure semantic search is almost never sufficient for production RAG. Users ask questions that implicitly require filtering: “What changed in our refund policy last quarter?” or “Find documentation for version 3.2.”
You need a vector database that supports filtered approximate nearest neighbor (ANN) search—not post-filtering, which wastes retrieval budget on irrelevant results. Qdrant, Weaviate, and Pinecone all handle pre-filtering well. pgvector with the halfvec type works at moderate scale and is worth considering if you’re already on Postgres.
A Quick Comparison
| Database | Managed Option | Filtered ANN | Hybrid Search | Best For |
|---|---|---|---|---|
| Qdrant | Yes (Cloud) | Yes | Yes | Performance-sensitive, self-hosted |
| Pinecone | Yes (only) | Yes | Yes (sparse+dense) | Fast time-to-production |
| Weaviate | Yes (Cloud) | Yes | Yes | GraphQL access, multi-modal |
| pgvector | No | Partial | No | Existing Postgres stack |
| Milvus/Zilliz | Yes (Zilliz) | Yes | Yes | Very large corpora |
Designing Your Ingestion Pipeline
The retrieval quality ceiling is set at ingestion time. A language model cannot compensate for poorly chunked, poorly embedded documents—and this is where I see teams cut corners most often.
Chunking Strategy
Fixed-size chunking (e.g., 512 tokens, 50-token overlap) is a reasonable baseline that works better than people give it credit for—but it breaks down on structured documents like API references, legal contracts, or tables.
For production systems, I’ve found a tiered approach works well:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_document(text: str, doc_type: str) -> list[str]:
if doc_type == "markdown":
splitter = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=80,
separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)
elif doc_type == "code":
splitter = RecursiveCharacterTextSplitter.from_language(
language="python",
chunk_size=400,
chunk_overlap=40,
)
else:
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
)
return splitter.split_text(text)
The key insight: your separator hierarchy should reflect the logical structure of your documents, not arbitrary character counts.
Metadata Enrichment
Every chunk should carry metadata that enables filtering and improves answer generation. At minimum: source URL or document ID, creation/update timestamp, section title, and document type. Don’t skip the timestamp—it’s what lets you handle “latest” queries correctly and prune stale content.
def build_document(chunk: str, source_meta: dict) -> dict:
return {
"text": chunk,
"metadata": {
"source_id": source_meta["id"],
"url": source_meta["url"],
"section": source_meta.get("section", ""),
"doc_type": source_meta["type"],
"updated_at": source_meta["updated_at"], # Unix timestamp
},
}
Embedding Model Selection
text-embedding-3-small (OpenAI) and embed-english-v3.0 (Cohere) are both solid general-purpose choices. If you have domain-specific text—legal, medical, or code-heavy—fine-tuned embeddings are worth evaluating. The tradeoff is operational complexity: you now own the embedding model deployment.
One thing that bites teams hard: embedding model version lock-in. If you update your embedding model, you need to re-embed your entire corpus. Build this into your ingestion pipeline from day one, not as an afterthought. (I’ve seen teams treat this as a quick weekend job and end up with a multi-week re-indexing project.)
Building the Retrieval Layer
Retrieval is where most of the engineering work lives. Semantic similarity alone—firing a single nearest-neighbor query and hoping for the best—fails in predictable ways. Honestly, this is the part where most RAG systems I’ve reviewed fall down, and it’s almost always fixable once you know where to look.
Hybrid Search
Combining dense vector search with sparse BM25-style keyword search consistently outperforms either alone, especially for exact-match queries (product names, error codes, proper nouns). Pinecone’s sparse-dense hybrid, Weaviate’s BM25+vector fusion, and Qdrant’s sparse vectors all support this pattern.
A simple fusion approach using Reciprocal Rank Fusion (RRF):
def reciprocal_rank_fusion(
dense_results: list[dict],
sparse_results: list[dict],
k: int = 60,
) -> list[dict]:
scores: dict[str, float] = {}
doc_map: dict[str, dict] = {}
for rank, doc in enumerate(dense_results):
doc_id = doc["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
doc_map[doc_id] = doc
for rank, doc in enumerate(sparse_results):
doc_id = doc["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
doc_map[doc_id] = doc
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [doc_map[doc_id] for doc_id, _ in ranked]
Query Expansion and Rewriting
User queries are often underspecified. “How do I cancel?” could mean cancel a subscription, cancel an order, or cancel a payment. A lightweight query rewriting step—using a fast model like claude-haiku-4-5 or gpt-4o-mini—can generate multiple query variants and retrieve against all of them before fusion.
async def expand_query(query: str, llm_client) -> list[str]:
prompt = f"""Generate 3 alternative phrasings of this query for document retrieval.
Return only the alternatives, one per line, no numbering.
Query: {query}"""
response = await llm_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
messages=[{"role": "user", "content": prompt}],
)
alternatives = response.content[0].text.strip().split("\n")
return [query] + [alt.strip() for alt in alternatives if alt.strip()]
The latency cost here is real. Run query expansion and vector retrieval concurrently, not sequentially.
Reranking
After retrieving your top-k candidates (typically 20–50), a cross-encoder reranker substantially improves the final context quality. Cohere’s Rerank API and cross-encoder models via sentence-transformers both work well. Pass only the top 5–8 reranked results to your generation model.
This two-stage retrieval pattern—broad recall, then precise ranking—is standard in search systems and translates directly to RAG.
Latency, Cost, and Reliability at Scale
Getting from “works in testing” to “works under load” requires treating the system as distributed infrastructure, not just an ML pipeline.
Caching Strategy
Not all queries are unique. Semantic caching—using embeddings to identify near-duplicate queries and serving cached responses—can cut both latency and cost dramatically for FAQ-style workloads. GPTCache and Redis with vector search both support this pattern.
Be conservative with cache TTLs. A cached answer about your pricing that’s 24 hours stale will cause support tickets.
import hashlib
import json
from typing import Optional
class SemanticCache:
def __init__(self, vector_db, threshold: float = 0.95):
self.vector_db = vector_db
self.threshold = threshold
async def get(self, query_embedding: list[float]) -> Optional[str]:
results = await self.vector_db.search(
collection="query_cache",
vector=query_embedding,
limit=1,
)
if results and results[0].score >= self.threshold:
return results[0].payload["cached_response"]
return None
async def set(
self,
query_embedding: list[float],
response: str,
query_hash: str,
) -> None:
await self.vector_db.upsert(
collection="query_cache",
points=[{
"id": query_hash,
"vector": query_embedding,
"payload": {"cached_response": response},
}],
)
Async and Parallel Execution
The retrieval pipeline has multiple independent operations: embedding the query, running dense search, running sparse search, and (optionally) expanding the query. These should run concurrently.
import asyncio
async def retrieve(query: str, filters: dict) -> list[dict]:
query_embedding_task = asyncio.create_task(embed(query))
expanded_queries_task = asyncio.create_task(expand_query(query))
query_embedding, expanded_queries = await asyncio.gather(
query_embedding_task,
expanded_queries_task,
)
search_tasks = [
dense_search(query_embedding, filters, top_k=20),
sparse_search(query, filters, top_k=20),
]
dense_results, sparse_results = await asyncio.gather(*search_tasks)
fused = reciprocal_rank_fusion(dense_results, sparse_results)
return await rerank(query, fused[:50])[:8]
A well-parallelized retrieval pipeline can hit p95 latencies under 400ms even with expansion and reranking. A sequential one doing the same operations can easily run 2–3 seconds. That difference is the gap between a system users tolerate and one they quietly abandon.
Circuit Breakers and Fallbacks
When your vector database is slow or unavailable, your entire application shouldn’t fail. Implement circuit breakers around retrieval with a graceful fallback—either a full-text search fallback or a response that acknowledges the context gap rather than hallucinating.
Evaluation and Monitoring
You cannot improve what you cannot measure. This is where most early-stage RAG projects stall: teams optimize by feel rather than by data.
Offline Evaluation
Before you ship changes, run them against a held-out evaluation set. You need three things: a set of test questions, ground-truth answers or reference documents, and metrics.
Useful metrics for RAG evaluation:
- Context precision: Of the retrieved chunks, what fraction were actually relevant?
- Context recall: Were all relevant documents retrieved?
- Answer faithfulness: Does the generated answer stay grounded in the retrieved context?
- Answer relevance: Does the answer address the question?
RAGAS automates most of this. It’s not perfect—I’ve had to supplement it with manual review on edge cases—but it gives you a consistent signal you can trend over time.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
results = evaluate(
dataset=eval_dataset, # HuggingFace Dataset with question/answer/contexts
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results)
Run this in CI. A retrieval change that tanks context precision should fail the pipeline, not get deployed.
Production Monitoring
Offline evaluation covers known failure modes. Production monitoring catches the ones you didn’t anticipate.
At minimum, log: query text (sanitized), retrieved chunk IDs and scores, final answer, latency breakdown by stage, and any explicit user feedback signals (thumbs up/down, follow-up questions).
Feed explicit negative feedback back into your evaluation set. The questions that confuse your system in production are far more valuable than synthetic test cases.
Tools worth evaluating: Langfuse and Arize Phoenix both offer RAG-specific tracing that shows retrieval quality alongside generation metrics—substantially more useful than generic LLM observability.
Common Pitfalls and How to Avoid Them
A few issues come up repeatedly when teams move from prototype to scale:
Corpus drift. Your documents change, but your embeddings don’t. Build incremental re-indexing into your ingestion pipeline. Track document checksums and re-embed on content change.
Context window mismanagement. Stuffing 20 retrieved chunks into a 4k-token context window wastes money and hurts answer quality. Be deliberate about how many chunks you pass, and measure whether more context actually helps for your use case.
Ignoring chunk boundaries. This one catches almost everyone. Your retrieval returns a chunk that starts mid-sentence or cuts off a critical detail. Implement “context stitching”—when you retrieve a chunk, also fetch its immediate neighbors and merge them before sending to the model.
Single-embedding assumption. Some documents benefit from multiple embeddings: a summary embedding for high-level retrieval and a detail embedding for precise lookup. This adds complexity but significantly improves retrieval for long-form documents.
Wrapping Up
The language model is the easy part. The work is in the retrieval infrastructure: how you chunk, embed, store, filter, retrieve, and rank—and how you measure all of it.
Start with a clear evaluation baseline before you touch anything else. Without it, you’re optimizing blind. Then instrument your production system so you’re learning from real traffic, not just from the scenarios you thought to test.
The teams shipping reliable RAG at scale aren’t using fundamentally different technology. They’re just being more rigorous about the engineering around it.
If you found this useful, the next step is setting up your evaluation baseline. Pick 50 representative questions from your target use case, build a simple RAGAS pipeline against them, and run it before your next retrieval change. That single practice will do more for your system quality than any other optimization.
Have questions about a specific part of your RAG stack? Drop them in the comments—retrieval tuning and embedding model selection are topics worth going deeper on.