RAG vs Fine-Tuning: What I Actually Learned After 6 Months of Building LLM Apps

Six months ago my team was building an internal support tool for a B2B SaaS company — about 120 employees, docs spread across Notion, Confluence, and a half-dead SharePoint instance from 2019. The ask was simple: a chatbot that could answer questions about internal processes without making stuff up.

Simple, right.

I had to make the call: RAG or fine-tune a model. I’d read the think pieces. I’d watched the YouTube explainers. None of them gave me the answer I actually needed, which was which one for this specific situation, and what will break first. So I spent about six months running both approaches across three different projects, and here’s what I actually found.


Why Most Comparisons Miss the Point

The framing of “RAG vs fine-tuning” is a bit of a false dichotomy, but before I get to that — the techniques solve genuinely different problems, and conflating them leads to expensive mistakes.

Here is the thing: fine-tuning changes how a model thinks. RAG changes what a model knows at query time. That distinction sounds obvious written out, but in practice it’s easy to reach for fine-tuning when you actually need RAG, because fine-tuning feels more “serious.” More ML-ish. More like you’re doing real AI work.

I made that mistake on my first project. More on that in a bit.

RAG — retrieval-augmented generation — keeps the base model frozen and instead pulls relevant chunks of text into the prompt at inference time. Your vector database stores embeddings of your documents; at query time you embed the user’s question, find the nearest neighbors, and stuff them into context. The model never “learns” your data — it just reads it fresh every time.

Fine-tuning takes a pre-trained model and continues training it on your dataset. The weights change. The model bakes your domain knowledge into its parameters. It becomes a different model.

Both have their place. The problem is figuring out which place that is.


Where RAG Actually Shines (And It’s Not Just “Knowledge Updates”)

Most articles will tell you to use RAG when your data changes frequently. That’s true, but it undersells the technique. RAG shines in a few other scenarios that I didn’t fully appreciate until I was deep in the weeds.

When your source of truth is authoritative and you need citations. The internal support tool I mentioned — legal and HR docs, policy PDFs, process guides — RAG was almost the only sensible answer. Users needed to know where the answer came from, not just what the answer was. With RAG you can return source chunks alongside the response. With fine-tuning, the model just… says things. Confidently. With no provenance.

When your corpus is large and heterogeneous. Fine-tuning on 10,000 Confluence pages would require careful curation, cleaning, formatting into training examples, and a training run that costs real money. With RAG, I ingested everything into a Chroma instance in a few hours and had a working prototype by end of day.

When you can’t afford to be wrong about freshness. Fine-tuned models go stale. If your pricing changes or your API specs update, a fine-tuned model will confidently give old information. A RAG system — if your ingestion pipeline is solid — serves fresh data.

Here’s a simplified version of the ingestion + query loop I used on that project (this was with LangChain 0.2.x, which had actually cleaned up the API considerably from the 0.1 chaos):

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Chunking strategy matters more than people think.
# 512 tokens with 64 overlap worked well for our Confluence-style docs.
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ".", " "]
)

docs = splitter.split_documents(raw_docs)
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./chroma_db")

# At query time
retriever = vectorstore.as_retriever(
    search_type="mmr",          # maximal marginal relevance — reduces redundant chunks
    search_kwargs={"k": 6}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o", temperature=0),
    retriever=retriever,
    return_source_documents=True  # this is the killer feature for trust
)

result = qa_chain.invoke({"query": "What's the policy on remote work expenses?"})
print(result["result"])
print([doc.metadata["source"] for doc in result["source_documents"]])

One thing I noticed: the chunk size and overlap parameters had way more impact on quality than I expected. I spent probably three days tuning those alone. Too small and the model lacks context; too large and you’re burning tokens on irrelevant text and the retrieval precision tanks. Your mileage may vary — it depends heavily on your document structure.

The practical takeaway: If your problem is “the model doesn’t know my data,” try RAG first. It’s faster to iterate, cheaper to run at prototype stage, and gives you provenance for free.


Fine-Tuning: When the Pain Is Actually Worth It

Fine-tuning has a deserved reputation for being annoying to get right. Dataset curation, training runs, eval frameworks, versioning model checkpoints — it’s a lot. So when is it worth it?

Honestly, the answer I’ve landed on is narrower than most people think: fine-tune when you need to change behavior, not just knowledge.

A few cases where I’ve seen it work well:

Consistent output format. If your application needs the model to always return structured JSON in a very specific schema — and prompt engineering alone keeps slipping — fine-tuning on examples of correct behavior is surprisingly effective. I worked on a data extraction pipeline where we needed the model to extract entities from unstructured text in a precise schema. After two weeks of prompt engineering gymnastics, a fine-tune on ~800 labeled examples fixed it in one training run.

Domain-specific tone and terminology. A medical or legal application where specific phrasing matters, where “patient” vs “client” vs “subject” carries meaning. Fine-tuning on domain-specific text can bake in the right register in a way that’s hard to reliably achieve via prompting.

Latency and cost at scale. This one surprised me. A fine-tuned smaller model (say, GPT-4o mini on a specific task) can outperform a larger general model on that task while costing a fraction of the price. If you’re doing millions of inferences a month on a well-defined task, the economics shift significantly.

Here’s a stripped-down example of what a fine-tuning dataset entry looks like for OpenAI’s API — the format has been stable since late 2023:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a data extraction assistant. Extract entities and return valid JSON only."
    },
    {
      "role": "user",
      "content": "Contract signed by Meridian Holdings LLC on 2025-11-14 for $240,000 annual service."
    },
    {
      "role": "assistant",
      "content": "{\"party\": \"Meridian Holdings LLC\", \"date\": \"2025-11-14\", \"value\": 240000, \"currency\": \"USD\", \"term\": \"annual\"}"
    }
  ]
}

You need this format repeated hundreds to thousands of times with varied examples. The curation process is tedious. I’m not going to pretend otherwise.

The practical takeaway: Fine-tune when the model’s behavior is wrong, not when its knowledge is lacking. If you catch yourself writing thousand-word system prompts to control output format, that’s usually a signal that fine-tuning would clean things up.


The Mistake That Cost Me Two Weeks

So — the mistake I promised. This is the part I wish someone had told me.

On my first LLM project (internal documentation assistant, different company, late 2024), I convinced myself we needed to fine-tune. The reasoning seemed solid: we had proprietary terminology, a specific tone of voice, and hundreds of internal documents. I spent two weeks building training data, ran a fine-tune on gpt-4o-mini, and… it was worse than the base model with a decent system prompt.

The problem: I had confused “the model doesn’t know our docs” with “the model behaves wrong.” Those are different problems. Fine-tuning injects style and behavior patterns from your training examples. It doesn’t inject factual document content reliably. I trained it on our documents formatted as Q&A pairs, and the model learned to sound like it knew things, while actually hallucinating details it didn’t retain from training.

What I should have done was RAG, immediately, for the knowledge problem — and maybe a thin layer of fine-tuning later if the tone was still off. Instead I spent two weeks and a non-trivial API bill going the wrong direction.

The tell? When I tested the fine-tuned model on questions about specific internal processes, it answered confidently but incorrectly about 30% of the time. The base model with RAG got the same questions right about 85% of the time, because it was reading the actual document.

Fine-tuning does not reliably make models memorize facts from your training data. That’s what RAG is for. This is probably the single most important thing to understand about the two techniques.


What I’d Actually Do, Given a Choice

Here’s my honest recommendation, not hedged with “it depends” because that’s a non-answer.

Start with RAG, almost always. It’s faster, it’s more auditable, it handles data freshness gracefully, and it’s easier to debug. When a RAG response is wrong, you can look at which chunks got retrieved and understand why. When a fine-tuned model is wrong, good luck unpacking that.

Add fine-tuning if — and only if — you have a specific behavioral problem. Inconsistent output format, wrong tone, poor performance on a narrow well-defined task. And make sure you have at least a few hundred high-quality training examples before you start, or you’re wasting a training run.

Consider both together. This is actually where things get interesting. A fine-tuned model that’s better at structured output plus RAG for knowledge retrieval can be a legitimately powerful combination. For the entity extraction pipeline I mentioned, we eventually combined a fine-tuned extraction model with a small RAG component that retrieved entity type definitions. The combination outperformed either approach alone by a meaningful margin.

I’m not 100% sure this combination scales elegantly beyond the mid-sized corpus we were working with — I’d want to see more data before recommending it universally. But on our use case (a few thousand documents, a specific extraction schema, ~50k inferences per month), it was worth the added complexity.

One more thing: whatever you choose, invest in your evaluation setup before you invest in your technique. If you can’t measure whether the model is right, you can’t know if your approach is working. I use a small golden dataset — 50-100 questions with verified correct answers — that I run against every new approach. It’s not glamorous. It’s probably the most valuable 20 hours I’ve spent on any of these projects.

The field is moving fast, and a lot of the received wisdom from 2023 is already outdated. But the fundamental question — does your problem need updated knowledge or changed behavior — that one’s stayed stable. Start there.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top