LangChain vs LlamaIndex vs Haystack: What Two Weeks in Production Actually Taught Me

My team got handed a RAG project earlier this year — 40,000 documents, mix of PDFs and Confluence exports, users who would notice if answers were wrong. I’d used LangChain for smaller stuff before, but this was the first time I actually ran all three major frameworks against real data, under real pressure, with a client watching the error rates.

Quick context: four-person eng team, Qdrant running on-prem, Claude as the LLM. The client’s tolerance for hallucinated answers was basically zero. Not a toy project.

LangChain’s Composition Model Is Great Until Something Goes Quietly Wrong

I’ve been using LangChain off and on since early 2023, and by now — v0.3+, LCEL as the standard — it genuinely is good at what it promises. The expression language makes wiring things together fast and readable:

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_qdrant import QdrantVectorStore

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 6, "fetch_k": 20}
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using only the context below.\n\nContext:\n{context}"),
    ("human", "{question}")
])

# This part is clean. The problem shows up later.
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | ChatAnthropic(model="claude-sonnet-4-6", temperature=0)
    | StrOutputParser()
)

response = chain.invoke("What's the refund policy for enterprise contracts?")

That code is clean. I actually like it.

The trouble showed up on day four. A retrieval step was returning empty results for certain query types, but only intermittently — maybe 8% of requests. The chain kept running. Returned a confident, fully hallucinated answer with zero retrieved context, and nothing in the output flagged it. I spent half an afternoon chasing this before realizing LangChain was silently passing an empty string as context to the prompt template.

You can guard against this. Callbacks exist. LangSmith is genuinely useful for tracing if you’re paying for it. But the default behavior when something fails upstream in a chain is to carry on — and for production RAG that’s a real problem I hadn’t budgeted time to solve. I ended up writing a custom runnable that validates retrieval counts before the context hits the prompt. Not hard, but it’s defensive scaffolding you don’t anticipate until it bites you.

The ecosystem advantage is real, though. When I hit a weird edge case with metadata filtering on Qdrant, there was a GitHub issue with a working fix posted five days earlier. That’s community size, not luck. If you’re integrating anything unusual — a niche vector store, custom document loaders, tool use patterns — LangChain almost certainly has it already.

LangChain is fast to start with, and the integrations will save you. Just write explicit failure guards around your retrieval steps, because the framework won’t.

LlamaIndex’s Node Model Finally Clicked for Me in Week Two

I’ll admit: I bounced off LlamaIndex about eighteen months ago. The “index everything” abstraction felt strange coming from LangChain’s chain-centric thinking, and the docs had this habit of showing four different ways to accomplish something without indicating which was current or preferred.

The v0.12 line is much better. But the real shift was accepting that LlamaIndex thinks in nodes, not documents — each chunk carries metadata forward through the whole pipeline. Once I stopped fighting that model and started working with it, things that had felt awkward suddenly made sense.

What genuinely surprised me — stopped me for a moment, honestly — was the SentenceWindowNodeParser. Found it while looking for something else, almost by accident:

from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.llms.anthropic import Anthropic
from llama_index.postprocessor.cohere_rerank import CohereRerank

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

Settings.llm = Anthropic(model="claude-sonnet-4-6")
Settings.node_parser = node_parser

index = VectorStoreIndex.from_vector_store(qdrant_store)
query_engine = index.as_query_engine(
    similarity_top_k=8,
    node_postprocessors=[CohereRerank(top_n=4)]
)

response = query_engine.query("What changed in the Q3 enterprise pricing tier?")
# response.source_nodes — exact retrieval, no hunting around
print(response.source_nodes[0].metadata)

The SentenceWindowNodeParser stores a small chunk for embedding but retrieves a larger surrounding window at query time. You get the precision of small embeddings with the readability of larger context. I had been implementing something like this manually in LangChain. It worked fine. But this was already built in, already tuned, and it took about three minutes to add to the pipeline.

The response.source_nodes access was something I also didn’t realize I’d care about until the client asked for citations in the UI. In LangChain I was doing gymnastics with callbacks to surface source metadata. Here it’s just… on the response object. Saved probably half a day of plumbing work.

Where it frustrated me: the query engine abstraction goes opaque fast when you need to customize retrieval logic significantly. I spent a day confused about why my custom retriever wasn’t applying a metadata filter I’d set — turned out to be a precedence issue in how the query engine assembles its retrieval components internally. Found the answer in a GitHub issue (#14337, two months old), but that hidden behavior cost me real time. When LlamaIndex misbehaves, the error usually isn’t the helpful kind.

That said: if the core of your project is document-heavy retrieval with complex chunking requirements, the built-in primitives here are ahead of the defaults in the other frameworks. You’ll feel the difference.

Haystack Is Boring and I Mean That as High Praise

Before this project, I associated Haystack with enterprise teams who’d chosen it because procurement required something with a company behind it. I was wrong, and I’m correcting that publicly.

Haystack 2.x restructured around explicit, typed pipelines — every component declared, connections explicit, nothing implicit. Setting it up felt verbose. More boilerplate than either of the others. I figured I’d move through the eval phase quickly and move on.

Then something broke in all three frameworks on the same day (my fault — I’d changed the Qdrant schema without updating the retriever configs). In LangChain, I got a runtime error deep in the chain with a stack trace pointing at internal LangChain code, not mine. In LlamaIndex, it silently returned empty results — I only caught it because I was checking source_nodes counts. In Haystack: component name, expected input type, received input type, and the line in my pipeline definition where the mismatch was. Fixed in under ten minutes.

That’s not an accident. The Haystack architecture is designed for exactly this — you can inspect the pipeline graph, each component logs its inputs and outputs clearly, and the type system catches mismatches before they become runtime surprises. For teams maintaining this code six months from now, that’s worth a lot.

The deepset team also ships Hayhooks, which wraps your pipeline in a REST API with minimal extra work. For this specific project — where the eventual owners are not Python developers — that mattered during handoff. Showing up with a running API and readable pipeline graphs is a different conversation than handing someone a Python repo and wishing them luck.

What I didn’t love: the community is smaller, and if you need an integration that LangChain has but Haystack doesn’t, you’re writing a custom component. I needed to pull data from an internal API with non-standard auth, and the LangChain loader already existed. In Haystack I wrote it from scratch — maybe three hours, not catastrophic, but real time.

For long-lived projects, regulated environments, or teams where the codebase needs to be maintainable by people who didn’t build it — Haystack’s verbosity pays dividends. For move-fast prototyping, it costs you upfront.

What the Retrieval Numbers Actually Looked Like

I ran an eval against 200 questions the client’s domain expert had written — real questions about real content. Not a rigorous academic study, but real enough to be useful. All three frameworks used identical Qdrant backends.

Retrieval precision (did the right document appear in the top 5?):
– LangChain (recursive text splitter, default settings): 71%
– LlamaIndex (SentenceWindowNodeParser + Cohere rerank): 84%
– Haystack (BM25 hybrid + Cohere rerank): 82%

LangChain’s number was dragged down by document categories where the default splitter was cutting badly — a smarter node parser probably closes most of that gap. The retrieval quality difference between frameworks is mostly about defaults, not fundamental architecture. Which means: don’t pick a framework because you think it retrieves better. Pick it based on your team’s ability to tune the retrieval configuration you actually need.

The more useful metric was time-to-working-pipeline:
– LangChain: 3 days (fast start, debugging tax after)
– Haystack: 4 days (slower setup, then very stable)
– LlamaIndex: 4.5 days (steeper start, paid off during tuning)

I’m genuinely not sure these numbers scale to a larger team — the debugging tax on LangChain probably distributes across more engineers and gets less painful. Your mileage will vary.

What I’m Actually Running in Prod

LlamaIndex.

Not because it’s perfect — it isn’t — but because for this specific problem (document-heavy RAG, retrieval quality as the primary metric, citation UI as a hard requirement), its built-in primitives were a better fit than what I assembled elsewhere. The node model matches how I was already thinking about the chunking problem. Source attribution is clean enough to build on directly. The retrieval pipeline felt less fragile than my equivalent LangChain setup.

If this had been a general-purpose AI application — agents, tool use, lots of different LLM calls, light retrieval — I’d probably still be on LangChain. The ecosystem advantage is real for that class of problem.

And if I were handing this project to a team that didn’t build it, or if we had a compliance requirement around logging every retrieval step, I’d have chosen Haystack and not second-guessed it. The verbosity is a feature in those contexts.

One thing none of these frameworks solved cleanly: eval tooling. I ended up running RAGAS externally regardless of which framework I was using. None of them have a good embedded eval story yet, and that gap keeps showing up in production. That’s a separate post — but worth knowing going in.

Pick the framework that maps to how you think about your problem, get a working pipeline running in a day, and then spend your optimization budget on retrieval strategy and eval. That’s where the quality actually comes from.

LangChain’s Composition Model Is Great Until Something Goes Quietly Wrong

LlamaIndex’s Node Model Finally Clicked for Me in Week Two

Haystack Is Boring and I Mean That as High Praise

What the Retrieval Numbers Actually Looked Like

What I’m Actually Running in Prod

Leave a Comment Cancel Reply