RAG vs Fine-Tuning: What I Learned After Choosing Wrong Twice

Six months ago my team shipped a customer support bot that confidently told users our return window was 60 days. It’s 30. The model had been fine-tuned on product documentation from 2023, and nobody — including me — thought to check whether that specific policy had changed. Three hundred support tickets later, we rebuilt the thing using RAG, and it’s been solid ever since.

That experience cost us real time and embarrassed a colleague in a quarterly review. So when I see blog posts framing this as a purely academic “it depends” discussion, I get a little annoyed. The choice between RAG and fine-tuning has concrete consequences, and I think developers are better served by hearing about those consequences than by another comparison table.

Here is what I actually know after building three LLM-powered features across two products and roughly 18 months of obsessing over this stuff.

Why Fine-Tuning Is Harder to Justify Than It Looks

Fine-tuning sounds like the obvious move when you want a model to behave in a very specific way. Train it on your data, bake in your domain knowledge, done. I thought exactly this when we built an internal code review assistant. We had 18 months of PR comments from senior engineers — surely that was good training signal.

We used OpenAI’s fine-tuning API (gpt-3.5-turbo at the time, this was mid-2024) and went through maybe four training runs before the outputs started feeling right. The model did genuinely get better at matching our team’s review style. Short, direct comments. No unnecessary praise. Links to internal style guide sections.

But then we hired three new engineers and updated the style guide. Suddenly the fine-tuned model was teaching bad habits — old conventions we’d explicitly deprecated. Retraining costs money and takes days of iteration. More importantly, it requires someone to curate a new training set, which is not a zero-effort task.

The gotcha I didn’t fully appreciate going in: fine-tuning teaches a model how to respond, not what to know. If your use case is “speak in a specific tone” or “always format output as JSON with these exact fields,” fine-tuning earns its complexity cost. If your use case is “answer questions accurately about our product,” you’re fighting an uphill battle against data staleness.

One other thing I noticed: the evaluation problem is real. How do you know your fine-tuned model is actually better? We ended up spending almost as much time on eval infrastructure as on the training itself. Without that, you’re basically guessing.

Practical takeaway: Fine-tuning makes sense when the problem is behavioral — consistent format, tone, task structure — and when that behavior is unlikely to change frequently. If you need the model to stay current with evolving information, you’re going to be retraining constantly or accepting staleness.

RAG Is Not a Silver Bullet Either (I Learned This the Hard Way)

After the return-policy disaster, I overcommitted to RAG as the answer for everything. I am not 100% proud of this. For about two months I was the person in every architecture discussion saying “just use RAG” — until we tried it for a use case where it genuinely struggled.

We were building a contract summarization tool. The idea was to chunk contracts into a vector store, retrieve relevant clauses, and have the model summarize or answer questions about them. Sounded straightforward. The retrieval part worked fine. The problem was that legal documents have complex cross-references — “as defined in Section 4.2(b)” — and our chunking strategy was splitting those definitions away from the clauses that referenced them. The model was answering based on incomplete context it couldn’t even recognize as incomplete.

RAG quality lives and dies on your chunking and retrieval strategy. This is not something most introductory tutorials spend enough time on. We eventually moved to a hybrid approach with larger chunks, overlap, and a reranking step using Cohere’s rerank API, which helped considerably. But it took weeks to get there.

Here is the thing: RAG also adds latency. Every query hits a vector database, does a similarity search, retrieves chunks, stuffs them into the context window, and then makes the LLM call. In our setup (Pinecone + GPT-4o), that added roughly 800ms to 1.2s on top of the model’s inference time. For an async task? Fine. For something that feels interactive? You notice it.

# A retrieval setup that bit us before we added reranking
from pinecone import Pinecone
from openai import OpenAI

pc = Pinecone(api_key="...")
index = pc.Index("contracts-v2")
client = OpenAI()

def retrieve_and_answer(query: str, top_k: int = 5) -> str:
    # Embed the query
    query_embedding = client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    ).data[0].embedding

    # Fetch top chunks — naive approach, no reranking
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )

    # Build context from retrieved chunks
    context = "\n\n---\n\n".join(
        r["metadata"]["text"] for r in results["matches"]
    )

    # The model only knows what we put in context here.
    # If a critical cross-reference got split across chunks, it's invisible.
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based only on the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )

    return response.choices[0].message.content

We eventually replaced that top_k=5 naive retrieval with a two-stage approach: fetch 20 candidates, rerank to 5. That single change improved answer quality more than anything else we tried.

Practical takeaway: RAG is excellent when your data changes, when you need source attribution, or when you’re working with a large document corpus. But the retrieval pipeline is real engineering work. Budget for it.

When the Combination Actually Makes Sense

So after getting burned by both approaches in isolation, I started paying attention to cases where teams use them together — and honestly, this is where things get interesting.

The pattern that’s clicked for me: fine-tune for behavior and style, use RAG for knowledge. Think of it as separating what the model knows from how it responds.

A concrete example: we built a documentation assistant for an internal platform. The model needed to (a) always respond in a specific structured format, (b) stay current with documentation that changes every sprint, and (c) gracefully say “I don’t know” rather than hallucinating. We fine-tuned a smaller model on examples of well-formatted responses with appropriate uncertainty expressions. Then we plugged that fine-tuned model into a RAG pipeline that retrieves from our docs at query time.

The result was better than either approach alone. The model stopped hallucinating API signatures (because correct ones were in the context), and the output formatting was consistent (because we’d trained that in). The “I don’t know” behavior — which is notoriously hard to get right through prompting alone — was much more reliable post-fine-tune.

# The combined setup, simplified
def answer_with_finetuned_rag(query: str) -> str:
    # Retrieve relevant docs
    docs = retrieve_docs(query, top_k=5)
    context = format_context(docs)

    # Use fine-tuned model for stylistically consistent, uncertainty-aware response
    # ft:gpt-3.5-turbo-0125:our-org:docs-assistant:abc123 — our fine-tuned checkpoint
    response = client.chat.completions.create(
        model="ft:gpt-3.5-turbo-0125:our-org:docs-assistant:abc123",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a documentation assistant. "
                    "Answer only from the provided context. "
                    "If the context doesn't contain the answer, say so explicitly."
                )
            },
            {
                "role": "user",
                "content": f"Documentation context:\n{context}\n\nQuestion: {query}"
            }
        ],
        temperature=0.2  # lower temp for more consistent formatting
    )

    return response.choices[0].message.content

Your mileage may vary here — I am not suggesting this is always worth the operational complexity. Running a fine-tuned model means you own that checkpoint. When OpenAI deprecates the base model, you need to retrain. That’s real overhead.

Practical takeaway: The combination earns its complexity only when you have a clear behavioral problem that prompting alone can’t solve, and a dynamic knowledge base that needs to stay current. If one of those isn’t true, pick the simpler path.

The Decision I Actually Use Now

After all of this, my mental model has gotten pretty simple. I ask three questions in order:

1. Does the model need to know things that change faster than you can retrain?
If yes, you need RAG. This covers most real-world product use cases: support bots, documentation assistants, anything touching a database of content that teams are actively updating.

2. Does the model have a consistent behavioral problem that prompting can’t fix?
By this I mean: you’ve tried multiple system prompt variations, you’ve tried few-shot examples, and the output still isn’t reliable. If yes, fine-tuning is worth exploring. If you haven’t tried serious prompt engineering first, do that — it’s cheaper and faster to iterate.

3. Is the performance difference worth the operational cost?
Fine-tuned models on older base checkpoints can actually be cheaper per token than the latest flagship models. If you’re doing millions of calls a day, that math might work out in your favor even factoring in retraining costs. For most teams building internal tools, it doesn’t.

For the vast majority of applications I see in the wild — and I am being specific here, not hedging — RAG is the right starting point. The data freshness problem alone disqualifies fine-tuning for most knowledge-retrieval use cases. Spend your energy on chunking strategy, embedding model selection, and retrieval quality before you reach for fine-tuning.

The exception is when you’re building something that needs to behave in a way that’s very hard to specify in a prompt — highly constrained output schemas, specialized reasoning patterns, tone requirements that need to be rock-solid. Then fine-tuning pays for itself.

And if someone on your team says “we should just fine-tune our docs into the model” — which I have heard more than once — ask them how they plan to keep it updated. That question tends to clarify things quickly.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top