I Benchmarked AI Coding Assistants Against Real Work for Three Weeks

Three months ago my team lead asked me to pick one AI coding tool for our five-person team to standardize on. We’re a fintech startup — TypeScript on the frontend, Django on the backend, a fair amount of gnarly financial calculation logic. We couldn’t have everyone on different tools. License costs aside, the context switching and “wait, how did you do that?” conversations were killing velocity. So I spent three weeks doing what I normally hate doing: structured testing.

I tested GitHub Copilot (using the Claude Sonnet backend, which is now the default for most plans), Cursor running claude-sonnet-4-6, Claude Code (Anthropic’s CLI tool, v1.3.x at the time), and Windsurf. I deliberately left out Continue.dev — it’s excellent for teams that want full control over their model routing, but the setup overhead wasn’t realistic for us right now.

The Test Suite I Used (And Why Synthetic Benchmarks Are Mostly Useless)

Every “AI benchmark” I’ve read lists things like HumanEval scores or pass@k on code competitions. Those numbers are fine for comparing models in the abstract, but they tell you almost nothing about whether a tool will help you ship features faster. Writing a recursive Fibonacci function is not the same as untangling a 400-line Django serializer that someone bolted payment logic onto over two years.

My test suite had four categories:

Greenfield completion — writing a new API endpoint from a spec, starting from scratch. Closest to the synthetic benchmarks, least interesting.

Refactoring with context — taking a real module from our codebase (anonymized) and asking each tool to split it according to single-responsibility principles. The module was ~600 lines, touched six other files, and had some subtle shared state I’d deliberately left in.

Bug diagnosis — three actual bugs from our git history. A race condition in our webhook handler, a Decimal precision error in a fee calculation, and a React state update that was causing double-renders during checkout.

Multi-file edits — adding a new feature (a discount code system) end-to-end: Django model, serializer, view, frontend hook, and tests.

The last two categories are where tools diverged dramatically.

Code Completion: Good Across the Board, Differentiated at the Margins

Completion quality is no longer the main differentiator. All four tools get basic function implementations right most of the time. Where they split apart is context awareness — specifically, how far they look when filling in the blank.

Copilot (with the Claude backend) felt slightly more conservative than Cursor in its suggestions. It rarely suggested something wildly wrong, but it also occasionally gave me completions that ignored a type I’d defined three files away. That’s not a new complaint — GitHub issue #6821 covers a version of this — and it’s improved noticeably since late 2025, but it still happens.

Cursor’s completions were more aggressive, which is a double-edged thing. When it was right, it felt almost psychic — it would complete a function body in a way that matched our codebase conventions without me having to tell it anything. When it was wrong, it confidently introduced patterns we don’t use. I caught it generating a custom retry decorator when we already have one in utils/http.py. Twice.

Claude Code behaved differently because it’s CLI-first rather than inline-completion-first. You’re explicitly asking it to do things rather than accepting or rejecting suggestions. That changes the workflow in a meaningful way — less passive, more intentional. Whether that’s better depends entirely on how you work.

Windsurf sat somewhere in the middle. Completions felt a bit behind Cursor in aggressiveness, which I actually appreciated for our codebase. Less noise.

For pure completion quality, you’re not making a bad choice with any of these. Pick based on your workflow style, not completion accuracy metrics.

Where Multi-File Understanding Actually Gets Tested

This is the section that matters. The multi-file feature test (discount codes, end-to-end) exposed real gaps.

Here’s the Django model I started with for context:

# discounts/models.py

class DiscountCode(models.Model):
    code = models.CharField(max_length=32, unique=True)
    discount_type = models.CharField(
        max_length=16,
        choices=[("percent", "Percent"), ("fixed", "Fixed")],
    )
    value = models.DecimalField(max_digits=10, decimal_places=2)
    max_uses = models.IntegerField(null=True, blank=True)
    times_used = models.IntegerField(default=0)
    valid_from = models.DateTimeField()
    valid_until = models.DateTimeField(null=True, blank=True)
    active = models.BooleanField(default=True)

    def is_valid(self):
        now = timezone.now()
        if not self.active:
            return False
        if self.max_uses and self.times_used >= self.max_uses:
            return False
        if self.valid_until and now > self.valid_until:
            return False
        return now >= self.valid_from

When I asked each tool to generate the corresponding serializer, view, and frontend hook, here’s roughly what happened:

Cursor got the serializer right immediately, with appropriate read-only fields for times_used. The DRF view it generated was functional but didn’t hook into our existing BaseAPIView class — it used vanilla APIView. It had no way to know about BaseAPIView without me pointing it to that file, which I hadn’t done. Fair enough.

Claude Code, because I was explicitly working in a shell session with broader codebase context, found BaseAPIView on its own and used it correctly. That was a genuine surprise. The generated view matched our existing patterns (including our custom error response format) without any prompting from me. The tradeoff is that I was running commands, reviewing output, and iterating — a slower loop than inline completion.

For the React hook, Cursor won cleanly. Its TypeScript inference was accurate, it followed our existing hook patterns (it had picked those up from the file I had open), and the generated code needed only minor tweaks.

Windsurf struggled on the multi-file task more than I expected. The serializer it generated was fine in isolation but didn’t match the field naming conventions we use elsewhere. Minor thing, but multiplied across a codebase it becomes friction.

The Gotcha That Cost Me an Afternoon

I want to be specific about a mistake I made during testing that taught me something real.

Early in week two, I was testing bug diagnosis — specifically the Decimal precision error. I gave Cursor the stack trace and the relevant calculation function and asked it to find the issue. It correctly identified that we were passing floats into a Decimal() constructor (classic Django money bug — Decimal(0.1) gives you Decimal('0.1000000000000000055511151231257827021181583404541015625')). It even suggested the fix:

# Wrong — float precision bleeds through
fee = Decimal(transaction_amount * 0.025)

# Correct — convert from string representation
fee = Decimal(str(transaction_amount)) * Decimal("0.025")

# Or better, if you control the inputs:
fee = Decimal("0.025") * transaction_amount  # assuming transaction_amount is already Decimal

Good catch. So I got comfortable. I started accepting suggestions more quickly, especially for the webhook race condition test. Cursor proposed wrapping the handler in a database-level lock using select_for_update(). Technically correct in isolation. But it didn’t account for the fact that our webhook handler runs inside a Celery task, and we had a separate transaction management setup that made this approach cause a deadlock under load. I didn’t catch it until I ran the integration tests.

The lesson isn’t “AI tools are dangerous” — I should have reviewed that suggestion more carefully. What the whole episode illustrated: these tools are genuinely good at the isolated correct answer and genuinely unreliable about your system’s broader constraints. The more context you explicitly provide, the better. Assuming the tool “knows” your architectural decisions because it read a few files is overconfident.

I’m not 100% sure any of these tools would handle that case correctly even with full codebase access — it requires understanding deployment topology, not just code.

Day-to-Day Feel Matters More Than You’d Think

The benchmarks tell one story. Using these things for eight hours straight tells another.

Copilot inside VS Code is the most invisible, in a good way. It doesn’t demand attention. The UX is mature. The chat panel has gotten genuinely useful over the past few months — the “explain this” and “fix this” workflows are smooth. But it’s also the most conservative tool in this comparison, and there are moments where I wanted it to take more initiative.

Cursor is the tool I’d recommend to someone who wants to feel like they’re moving fast. The composer/agent mode for multi-file edits is impressive when it works. When it doesn’t work, you can end up with a half-edited codebase that’s harder to reason about than where you started. I learned to commit before any significant Cursor agent run.

Claude Code rewards a different kind of developer. If you’re comfortable in the terminal and you like explicit, conversational interaction with your tools, it’s excellent. The context management is the best I’ve seen — you can point it at specific files, give it architectural constraints, and it actually uses that information. But if you want something that lives in your editor and disappears into the background, this is not it.

Windsurf is underrated. I think it gets less attention because Codeium as a company doesn’t dominate the discourse the way Anthropic and Microsoft do, but the product is solid. Autocomplete felt fast (noticeably lower latency than Cursor on my M3 MacBook Pro), and the context awareness in the IDE was better than I expected. If you or your team is cost-sensitive, this is worth a real look.

What I Actually Recommended

We landed on Cursor for the team, with one important caveat I built into our workflow: all agent/composer runs start with a fresh git commit. No exceptions. This took about a day to get everyone doing consistently, but it’s eliminated the “what did Cursor even do to these files” confusion.

For my own work specifically — solo exploration, architectural spikes, code I really need to understand deeply — I keep Claude Code installed alongside it. The two tools aren’t redundant. Cursor handles the execution; Claude Code handles the understanding.

Copilot is genuinely good and I’d have no objection if someone on the team preferred it, but the slightly lower ceiling on multi-file tasks made it hard to recommend as the team default. Windsurf is my honest runner-up — if Cursor’s pricing changes or our needs shift, it’s the first place I’d look.

One thing I’d push back on in most comparisons I’ve read: people spend too much time benchmarking completion quality and not enough time thinking about error recovery. The question isn’t just “does the AI write good code?” It’s “when it writes bad code — and it will — how fast can you figure that out and fix it?” Tools that are more conservative, or that give you more explicit output to review, often win on that second metric even when they lose on the first.

Your setup is probably different from mine. Five-person fintech team, mixed TypeScript and Django, compliance considerations that make us careful about what context we send to third-party services — all of that shaped what mattered in my testing. But if your situation is similar, I’d go Cursor with disciplined git hygiene, and Claude Code for the sessions where you need to actually think.