{"id":160,"date":"2026-03-08T23:09:19","date_gmt":"2026-03-08T23:09:19","guid":{"rendered":"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/"},"modified":"2026-03-18T22:00:07","modified_gmt":"2026-03-18T22:00:07","slug":"rag-deep-dive-chunking-strategies-vector-databases","status":"publish","type":"post","link":"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/rag-deep-dive-chunking-strategies-vector-databases\/","title":{"rendered":"RAG in the Wild: What I Learned After Two Weeks of Chunking Experiments"},"content":{"rendered":"<p>Three months ago I shipped a RAG pipeline that I was genuinely proud of. Semantic search over our internal docs, OpenAI embeddings, Pinecone on the backend. It felt modern. Then someone on <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/edge-computing-in-2026-why-developers-are-adopting\/\" title=\"Our Team\">our team<\/a> asked it &#8220;what&#8217;s our parental leave policy?&#8221; and it returned a confident three-paragraph answer that was completely fabricated \u2014 stitched together from an old HR doc, a Confluence page about PTO, and <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/github-copilot-alternatives-in-2026-cursor-codeium\/\" title=\"What I\">what I<\/a> can only assume was vibes.<\/p>\n<p>That was my wake-up call. The embedding model wasn&#8217;t broken. The vector DB wasn&#8217;t broken. The retrieval step \u2014 the part I had basically copy-pasted from a tutorial and moved on \u2014 was the problem. I spent the next <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/langchain-vs-llamaindex-vs-haystack-building-produ\/\" title=\"Two Weeks\">two weeks<\/a> obsessively fixing it, and this is <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/08\/github-copilot-alternatives-in-2026-cursor-codeium\/\" title=\"What I\">what I<\/a> found.<\/p>\n<hr \/>\n<h2>Your Chunk Size Is Probably Wrong (Mine Was)<\/h2>\n<p>Most tutorials <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/setting-up-github-actions-for-python-applications\/\" title=\"Tell You\">tell you<\/a> to chunk at 512 tokens and call it a day. I did that. It worked okay for short factual lookups but fell apart the moment a question required synthesizing information across a longer document \u2014 like, say, a policy that spans three sections with cross-references.<\/p>\n<p>The tension: small chunks improve retrieval precision (the relevant sentence <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/github-copilot-vs-cursor-vs-windsurf-which-ai-codi\/\" title=\"Actually Makes\">actually makes<\/a> it into the top-k results) but hurt answer quality because you&#8217;ve stripped the context that makes that sentence meaningful. Large chunks keep the context but hurt precision \u2014 suddenly your top result is a 1,200-token blob where the relevant info is buried in the middle.<\/p>\n<p>I ran a controlled experiment on our documentation corpus (~800 documents, mix of Markdown and PDFs). Three strategies:<\/p>\n<p><strong>Fixed-size chunking<\/strong> at 512 tokens with 50-token overlap. Baseline. Easy to implement, predictable performance. Also where I started.<\/p>\n<p><strong>Semantic chunking<\/strong> \u2014 splitting on sentence boundaries and then grouping sentences until you hit a semantic shift, measured by cosine distance between consecutive sentence embeddings. I used <code>langchain<\/code>&#8216;s <code>SemanticChunker<\/code> (LangChain v0.2.x). This produced chunks ranging from 80 to 600 tokens depending on document structure.<\/p>\n<p><strong>Hierarchical \/ parent-document retrieval<\/strong> \u2014 store small chunks for retrieval, but when a chunk is retrieved, return its larger parent chunk to the LLM. This is the one <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/typescript-5x-in-2026-features-that-actually-matte\/\" title=\"That Actually\">that actually<\/a> moved the needle.<\/p>\n<pre><code class=\"language-python\">from langchain.retrievers import ParentDocumentRetriever\nfrom langchain.storage import InMemoryStore\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\n\n# Child chunks \u2014 what gets embedded and searched\nchild_splitter = RecursiveCharacterTextSplitter(chunk_size=200)\n\n# Parent chunks \u2014 <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/setting-up-github-actions-for-python-applications\/\" title=\"What the\">what the<\/a> LLM actually sees\nparent_splitter = RecursiveCharacterTextSplitter(chunk_size=800)\n\nstore = InMemoryStore()\nretriever = ParentDocumentRetriever(\n    vectorstore=vectorstore,\n    docstore=store,\n    child_splitter=child_splitter,\n    parent_splitter=parent_splitter,\n)\n\n# Add your docs \u2014 this indexes child chunks but stores parent chunks\nretriever.add_documents(docs)\n\n# At query time, retrieval happens on child embeddings\n# but the returned context is the full parent chunk\nresults = retriever.invoke(&quot;what is the parental leave policy?&quot;)\n<\/code><\/pre>\n<p>On my evaluation set (100 manually-labeled QA pairs), parent-document retrieval improved answer accuracy from 61% to 79%. Semantic chunking got me to 68% \u2014 better than fixed-size, but not as much as I expected given how much more complex it is to implement.<\/p>\n<p>Don&#8217;t overthink chunking <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/cloudflare-workers-vs-aws-lambda-which-edge-runtim\/\" title=\"at the\">at the<\/a> start. Get a fixed-size baseline at 512 tokens, <em>then<\/em> try parent-document retrieval before investing in semantic chunking. The complexity-to-benefit ratio on semantic chunking genuinely disappointed me for most real-world corpora.<\/p>\n<hr \/>\n<h2>Picking a Vector Database Without Losing Your Mind<\/h2>\n<p>I tested four options: Pinecone, Qdrant, Weaviate, and pgvector. My setup was a single-node deployment for a team of ~30 people \u2014 not a million-user product \u2014 so take the performance numbers with appropriate context.<\/p>\n<p><strong>Pinecone<\/strong> is genuinely the easiest to get started with. Fully managed, no infrastructure headaches, the <a href=\"https:\/\/www.amazon.com\/s?k=python+programming+book&#038;tag=synsun0f-20\" title=\"Best Python Books on Amazon\" rel=\"nofollow sponsored\" target=\"_blank\">Python<\/a> SDK is clean. The frustration hits when you want to do anything beyond basic ANN search \u2014 metadata filtering has gotchas around cardinality, and the pricing model punishes you for storing large metadata payloads. I also hit a subtle bug where filtered queries with high-cardinality string fields returned inconsistent results (this was a known issue in early 2025). For a small internal tool? Fine. For something you&#8217;ll iterate on heavily, the control limitations start to chafe.<\/p>\n<p><strong>Qdrant<\/strong> became my favorite after about a week. Open source, runs locally with <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"Run Docker on DigitalOcean\" rel=\"nofollow sponsored\" target=\"_blank\">Docker<\/a> in 30 seconds, the query API is expressive, and sparse+dense hybrid search is a first-class feature. The Rust core means it&#8217;s fast. My one complaint: the documentation has some gaps, and the <a href=\"https:\/\/www.amazon.com\/s?k=python+programming+book&#038;tag=synsun0f-20\" title=\"Best Python Books on Amazon\" rel=\"nofollow sponsored\" target=\"_blank\">Python<\/a> client&#8217;s async support felt slightly rough around the edges as of v1.7. But the GitHub issues are responsive and the community is active.<\/p>\n<p><strong>Weaviate<\/strong> has an impressive feature set \u2014 built-in BM25, native hybrid search, a GraphQL query interface. Honestly it might be the best choice if you&#8217;re building something with complex multi-modal retrieval needs. For a straightforward RAG pipeline it felt like a lot of surface area to learn when I didn&#8217;t need most of it.<\/p>\n<p><strong>pgvector<\/strong> is the &#8220;use what you have&#8221; option, and it&#8217;s more viable than people give it credit for. If you&#8217;re already on Postgres, <code>pgvector<\/code> with an HNSW index gets you surprisingly far \u2014 we used it for a prototype and latency was acceptable up to a few hundred thousand vectors. Past that, I genuinely don&#8217;t know; I didn&#8217;t test at that scale.<\/p>\n<p>My call: fully managed and don&#8217;t need hybrid search \u2192 Pinecone. Want control and hybrid search out of the box \u2192 Qdrant. Already on Postgres with a corpus under 500k chunks \u2192 pgvector is a legitimate choice, not a fallback.<\/p>\n<hr \/>\n<h2>The Retrieval Step Is Where Most RAG Pipelines Leave Performance on the Table<\/h2>\n<p>Basic vector search \u2014 embed the query, find the nearest neighbors \u2014 is a reasonable starting point. It&#8217;s also where people stop, and it shows.<\/p>\n<p><strong>Hybrid search (sparse + dense)<\/strong> made a surprisingly large difference. Dense embeddings capture semantics but struggle with exact keyword matches \u2014 product names, error codes, specific version strings. Sparse retrieval (BM25) nails those. Combining them with reciprocal rank fusion (RRF) gives you the best of both.<\/p>\n<pre><code class=\"language-python\">from qdrant_client import QdrantClient, models\n\n# Assuming you've set up a collection with both dense and sparse vectors\n\nresults = client.query_points(\n    collection_name=&quot;docs&quot;,\n    prefetch=[\n        # Dense vector search (semantic)\n        models.Prefetch(\n            query=dense_embedding,  # your query embedding\n            using=&quot;dense&quot;,\n            limit=20,\n        ),\n        # Sparse vector search (BM25-style)\n        models.Prefetch(\n            query=models.SparseVector(\n                indices=sparse_indices,\n                values=sparse_values,\n            ),\n            using=&quot;sparse&quot;,\n            limit=20,\n        ),\n    ],\n    # RRF fusion happens here\n    query=models.FusionQuery(fusion=models.Fusion.RRF),\n    limit=5,\n)\n<\/code><\/pre>\n<p>One thing I noticed: hybrid search helped most on technical documentation with product-specific terminology. On more conversational or policy-style content, the improvement was modest. If your corpus is dense with jargon or version numbers, it&#8217;s worth the implementation overhead.<\/p>\n<p><strong>Reranking<\/strong> is the other lever that moved things significantly. After your initial retrieval (say, top-20 chunks), run a cross-encoder reranker to reorder them before passing to the LLM. The intuition: bi-encoders (what you <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/claude-vs-gpt-4o-vs-gemini-20-which-ai-model-to-us\/\" title=\"Use for\">use for<\/a> initial retrieval) encode query and document independently for speed. Cross-encoders look at query+document jointly and are much more accurate \u2014 just too slow to run at retrieval scale, which is why you do it on the reduced candidate set.<\/p>\n<p>I used <code>cross-encoder\/ms-marco-MiniLM-L-6-v2<\/code> from HuggingFace. It added about 80ms to latency on a CPU for reranking 20 candidates, which was acceptable for us. Cohere&#8217;s Rerank API is the managed alternative \u2014 I haven&#8217;t used <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/copilot-vs-cursor-vs-codeium\/\" title=\"It in\">it in<\/a> <a href=\"https:\/\/m.do.co\/c\/06956e5e2802\" title=\"DigitalOcean <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/typescript-5x-in-2026-features-that-actually-matte\/\" title=\"for Production\">for Production<\/a> Workloads&#8221; rel=&#8221;nofollow sponsored&#8221; target=&#8221;_blank&#8221;>production<\/a> but have heard good things from people who have.<\/p>\n<p><strong>The MMR gotcha:<\/strong> I added Maximal Marginal Relevance to reduce redundancy in retrieved chunks, thinking it would help. For some queries it did. But it also filtered out a chunk containing the <em>exact<\/em> relevant detail because a more general chunk was ranked higher and deemed &#8220;too similar.&#8221; My recall numbers actually dropped. I ended up disabling MMR and addressing redundancy through chunking strategy instead. Don&#8217;t assume it&#8217;s free until you&#8217;ve tested it on your specific dataset.<\/p>\n<p>If your corpus is keyword-heavy, implement hybrid search. If retrieval quality still feels off after that, add a reranker \u2014 it&#8217;s often the single highest-ROI improvement available. Be skeptical of MMR.<\/p>\n<hr \/>\n<h2>Evaluating Whether Any of This Actually Helps<\/h2>\n<p>Nobody talks about this enough: you need an eval harness <em>before<\/em> you start tuning, or you&#8217;re flying blind. I built mine with <code>ragas<\/code> (v0.1.x) and ~100 manually curated QA pairs from our actual documentation.<\/p>\n<p>Four metrics I tracked:<br \/>\n&#8211; <strong>Faithfulness<\/strong> \u2014 does the answer stick to what&#8217;s in the retrieved context?<br \/>\n&#8211; <strong>Answer relevancy<\/strong> \u2014 is the answer actually responsive to the question?<br \/>\n&#8211; <strong>Context precision<\/strong> \u2014 are the retrieved chunks relevant?<br \/>\n&#8211; <strong>Context recall<\/strong> \u2014 is the relevant information making it into the retrieved chunks at all?<\/p>\n<p>My initial pipeline had fine faithfulness (the LLM wasn&#8217;t hallucinating beyond what was in the retrieved docs) but terrible context recall \u2014 I was only surfacing the relevant chunk ~60% of the time. That&#8217;s why the parental leave answer was wrong: the relevant doc wasn&#8217;t making it into the top-5 results. Once I identified that, the fix was obvious \u2014 better chunking plus hybrid search to catch &#8220;parental leave&#8221; as a keyword match.<\/p>\n<p>Without the eval setup I would have kept tweaking the prompt. That&#8217;s the trap.<\/p>\n<hr \/>\n<h2>What I&#8217;d Actually Build With Today<\/h2>\n<p>Start with <code>pgvector<\/code> if the team is already on Postgres. It removes an infrastructure dependency and is plenty capable for most internal tools. Once you hit scale issues or need hybrid search badly, migrate to Qdrant \u2014 the data migration is not that painful.<\/p>\n<p>For embeddings, <code>text-embedding-3-large<\/code> from OpenAI (3072 dims) or <code>nomic-embed-text<\/code> if you want a solid open-source option. I&#8217;m not convinced the latest embedding models are worth the cost premium over <code>text-embedding-3-large<\/code> for most RAG use cases \u2014 though I haven&#8217;t benchmarked the most recent releases.<\/p>\n<p>Parent-document retrieval over semantic chunking. Simpler to implement, easier to debug, better performance in my tests.<\/p>\n<p>Hybrid search from day one if you control your vector DB choice. BM25 is not dead.<\/p>\n<p>One cross-encoder reranker pass before sending context to the LLM. The latency cost is <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/05\/copilot-vs-cursor-vs-codeium\/\" title=\"Worth It\">worth it<\/a>.<\/p>\n<p>And \u2014 build your eval harness before anything else. Even 50 QA pairs is enough to <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/setting-up-github-actions-for-python-applications\/\" title=\"Tell You\">tell you<\/a> whether your changes are helping or hurting. Without it you&#8217;re just iterating on vibes, and I spent <a href=\"https:\/\/blog.rebalai.com\/en\/2026\/03\/09\/langchain-vs-llamaindex-vs-haystack-building-produ\/\" title=\"Two Weeks\">two weeks<\/a> learning that the hard way.<\/p>\n<p><!-- Reviewed: 2026-03-07 | Status: ready_to_publish | Changes: removed repeated \"Practical takeaway:\" template blocks, fixed text-embedding-3-large dims (1536\u21923072), fixed Qdrant import (added models), updated meta_description to 157 chars, varied paragraph lengths, changed two \"Here's the thing\" openers, tightened MMR section, cut redundant phrasing in final section --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Three months ago I shipped a RAG pipeline that I was genuinely proud of.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1],"tags":[],"class_list":["post-160","post","type-post","status-publish","format-standard","hentry","category-general"],"_links":{"self":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/160","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/comments?post=160"}],"version-history":[{"count":12,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/160\/revisions"}],"predecessor-version":[{"id":511,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/posts\/160\/revisions\/511"}],"wp:attachment":[{"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/media?parent=160"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/categories?post=160"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.rebalai.com\/en\/wp-json\/wp\/v2\/tags?post=160"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}