RAG in the Wild: What I Learned After Two Weeks of Chunking Experiments

Three months ago I shipped a RAG pipeline that I was genuinely proud of. Semantic search over our internal docs, OpenAI embeddings, Pinecone on the backend. It felt modern. Then someone on our team asked it “what’s our parental leave policy?” and it returned a confident three-paragraph answer that was completely fabricated — stitched together from an old HR doc, a Confluence page about PTO, and what I can only assume was vibes.

That was my wake-up call. The embedding model wasn’t broken. The vector DB wasn’t broken. The retrieval step — the part I had basically copy-pasted from a tutorial and moved on — was the problem. I spent the next two weeks obsessively fixing it, and this is what I found.

Your Chunk Size Is Probably Wrong (Mine Was)

Most tutorials tell you to chunk at 512 tokens and call it a day. I did that. It worked okay for short factual lookups but fell apart the moment a question required synthesizing information across a longer document — like, say, a policy that spans three sections with cross-references.

The tension: small chunks improve retrieval precision (the relevant sentence actually makes it into the top-k results) but hurt answer quality because you’ve stripped the context that makes that sentence meaningful. Large chunks keep the context but hurt precision — suddenly your top result is a 1,200-token blob where the relevant info is buried in the middle.

I ran a controlled experiment on our documentation corpus (~800 documents, mix of Markdown and PDFs). Three strategies:

Fixed-size chunking at 512 tokens with 50-token overlap. Baseline. Easy to implement, predictable performance. Also where I started.

Semantic chunking — splitting on sentence boundaries and then grouping sentences until you hit a semantic shift, measured by cosine distance between consecutive sentence embeddings. I used langchain‘s SemanticChunker (LangChain v0.2.x). This produced chunks ranging from 80 to 600 tokens depending on document structure.

Hierarchical / parent-document retrieval — store small chunks for retrieval, but when a chunk is retrieved, return its larger parent chunk to the LLM. This is the one that actually moved the needle.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Child chunks — what gets embedded and searched
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Parent chunks — what the LLM actually sees
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=800)

store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add your docs — this indexes child chunks but stores parent chunks
retriever.add_documents(docs)

# At query time, retrieval happens on child embeddings
# but the returned context is the full parent chunk
results = retriever.invoke("what is the parental leave policy?")

On my evaluation set (100 manually-labeled QA pairs), parent-document retrieval improved answer accuracy from 61% to 79%. Semantic chunking got me to 68% — better than fixed-size, but not as much as I expected given how much more complex it is to implement.

Don’t overthink chunking at the start. Get a fixed-size baseline at 512 tokens, then try parent-document retrieval before investing in semantic chunking. The complexity-to-benefit ratio on semantic chunking genuinely disappointed me for most real-world corpora.

Picking a Vector Database Without Losing Your Mind

I tested four options: Pinecone, Qdrant, Weaviate, and pgvector. My setup was a single-node deployment for a team of ~30 people — not a million-user product — so take the performance numbers with appropriate context.

Pinecone is genuinely the easiest to get started with. Fully managed, no infrastructure headaches, the Python SDK is clean. The frustration hits when you want to do anything beyond basic ANN search — metadata filtering has gotchas around cardinality, and the pricing model punishes you for storing large metadata payloads. I also hit a subtle bug where filtered queries with high-cardinality string fields returned inconsistent results (this was a known issue in early 2025). For a small internal tool? Fine. For something you’ll iterate on heavily, the control limitations start to chafe.

Qdrant became my favorite after about a week. Open source, runs locally with Docker in 30 seconds, the query API is expressive, and sparse+dense hybrid search is a first-class feature. The Rust core means it’s fast. My one complaint: the documentation has some gaps, and the Python client’s async support felt slightly rough around the edges as of v1.7. But the GitHub issues are responsive and the community is active.

Weaviate has an impressive feature set — built-in BM25, native hybrid search, a GraphQL query interface. Honestly it might be the best choice if you’re building something with complex multi-modal retrieval needs. For a straightforward RAG pipeline it felt like a lot of surface area to learn when I didn’t need most of it.

pgvector is the “use what you have” option, and it’s more viable than people give it credit for. If you’re already on Postgres, pgvector with an HNSW index gets you surprisingly far — we used it for a prototype and latency was acceptable up to a few hundred thousand vectors. Past that, I genuinely don’t know; I didn’t test at that scale.

My call: fully managed and don’t need hybrid search → Pinecone. Want control and hybrid search out of the box → Qdrant. Already on Postgres with a corpus under 500k chunks → pgvector is a legitimate choice, not a fallback.

The Retrieval Step Is Where Most RAG Pipelines Leave Performance on the Table

Basic vector search — embed the query, find the nearest neighbors — is a reasonable starting point. It’s also where people stop, and it shows.

Hybrid search (sparse + dense) made a surprisingly large difference. Dense embeddings capture semantics but struggle with exact keyword matches — product names, error codes, specific version strings. Sparse retrieval (BM25) nails those. Combining them with reciprocal rank fusion (RRF) gives you the best of both.

from qdrant_client import QdrantClient, models

# Assuming you've set up a collection with both dense and sparse vectors

results = client.query_points(
    collection_name="docs",
    prefetch=[
        # Dense vector search (semantic)
        models.Prefetch(
            query=dense_embedding,  # your query embedding
            using="dense",
            limit=20,
        ),
        # Sparse vector search (BM25-style)
        models.Prefetch(
            query=models.SparseVector(
                indices=sparse_indices,
                values=sparse_values,
            ),
            using="sparse",
            limit=20,
        ),
    ],
    # RRF fusion happens here
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=5,
)

One thing I noticed: hybrid search helped most on technical documentation with product-specific terminology. On more conversational or policy-style content, the improvement was modest. If your corpus is dense with jargon or version numbers, it’s worth the implementation overhead.

Reranking is the other lever that moved things significantly. After your initial retrieval (say, top-20 chunks), run a cross-encoder reranker to reorder them before passing to the LLM. The intuition: bi-encoders (what you use for initial retrieval) encode query and document independently for speed. Cross-encoders look at query+document jointly and are much more accurate — just too slow to run at retrieval scale, which is why you do it on the reduced candidate set.

I used cross-encoder/ms-marco-MiniLM-L-6-v2 from HuggingFace. It added about 80ms to latency on a CPU for reranking 20 candidates, which was acceptable for us. Cohere’s Rerank API is the managed alternative — I haven’t used it in for Production Workloads” rel=”nofollow sponsored” target=”_blank”>production but have heard good things from people who have.

The MMR gotcha: I added Maximal Marginal Relevance to reduce redundancy in retrieved chunks, thinking it would help. For some queries it did. But it also filtered out a chunk containing the exact relevant detail because a more general chunk was ranked higher and deemed “too similar.” My recall numbers actually dropped. I ended up disabling MMR and addressing redundancy through chunking strategy instead. Don’t assume it’s free until you’ve tested it on your specific dataset.

If your corpus is keyword-heavy, implement hybrid search. If retrieval quality still feels off after that, add a reranker — it’s often the single highest-ROI improvement available. Be skeptical of MMR.

Evaluating Whether Any of This Actually Helps

Nobody talks about this enough: you need an eval harness before you start tuning, or you’re flying blind. I built mine with ragas (v0.1.x) and ~100 manually curated QA pairs from our actual documentation.

Four metrics I tracked:
– Faithfulness — does the answer stick to what’s in the retrieved context?
– Answer relevancy — is the answer actually responsive to the question?
– Context precision — are the retrieved chunks relevant?
– Context recall — is the relevant information making it into the retrieved chunks at all?

My initial pipeline had fine faithfulness (the LLM wasn’t hallucinating beyond what was in the retrieved docs) but terrible context recall — I was only surfacing the relevant chunk ~60% of the time. That’s why the parental leave answer was wrong: the relevant doc wasn’t making it into the top-5 results. Once I identified that, the fix was obvious — better chunking plus hybrid search to catch “parental leave” as a keyword match.

Without the eval setup I would have kept tweaking the prompt. That’s the trap.

What I’d Actually Build With Today

Start with pgvector if the team is already on Postgres. It removes an infrastructure dependency and is plenty capable for most internal tools. Once you hit scale issues or need hybrid search badly, migrate to Qdrant — the data migration is not that painful.

For embeddings, text-embedding-3-large from OpenAI (3072 dims) or nomic-embed-text if you want a solid open-source option. I’m not convinced the latest embedding models are worth the cost premium over text-embedding-3-large for most RAG use cases — though I haven’t benchmarked the most recent releases.

Parent-document retrieval over semantic chunking. Simpler to implement, easier to debug, better performance in my tests.

Hybrid search from day one if you control your vector DB choice. BM25 is not dead.

One cross-encoder reranker pass before sending context to the LLM. The latency cost is worth it.

And — build your eval harness before anything else. Even 50 QA pairs is enough to tell you whether your changes are helping or hurting. Without it you’re just iterating on vibes, and I spent two weeks learning that the hard way.

Your Chunk Size Is Probably Wrong (Mine Was)

Picking a Vector Database Without Losing Your Mind

The Retrieval Step Is Where Most RAG Pipelines Leave Performance on the Table

Evaluating Whether Any of This Actually Helps

What I’d Actually Build With Today

Leave a Comment Cancel Reply