Building RAG with pgvector: Why I Stopped Paying for Pinecone

The $340 invoice showed up on a Tuesday in late January. Not wild money in isolation, but this was for an internal RAG tool serving exactly 11 people — my team at a fintech startup — and when I did the math, that was $30 per user per month for what amounted to a document Q&A system sitting on top of our internal wikis and runbooks.

I’d chosen Pinecone back in mid-2024 because it was the obvious choice at the time. Every RAG tutorial pointed at it, the docs were solid, and I got something working in an afternoon. But “working in an afternoon” and “worth $340/month for 11 users” are different questions.

So I spent two weeks running pgvector in parallel, measuring what actually mattered for my use case, and eventually pulled the plug on Pinecone. Here’s what I found.

What Pinecone Was Actually Good At — And Where It Started to Annoy Me

Our RAG setup was pretty standard: chunk internal documents into 512-token segments, embed with OpenAI’s text-embedding-3-small, store in Pinecone, retrieve top-5 chunks at query time, stuff them into a GPT-4o prompt. The whole thing ran as a FastAPI service our team accessed through a Slack bot.

Pinecone’s managed infrastructure meant I spent zero time thinking about index configuration in those early months. The Python SDK is genuinely pleasant. I didn’t touch a single config file for index tuning, and it just worked.

The metadata filtering is where things started to crack. Our documents have metadata attached — source system, department, last-updated date — and I needed to filter on these at query time. Pinecone supports it, but the behavior when you combine strict filters with semantic search gets strange. I filed a support ticket in November when I noticed that filtering by last_updated > 2024-01-01 was silently reducing recall in ways I couldn’t explain. The response was essentially “this is expected behavior with how we handle filtered ANN search.” Technically valid, but not exactly useful when you’re trying to figure out why your RAG tool keeps surfacing stale content.

The other irritant: pricing. We were on Starter at $70/month, hit a namespace limit, and got bumped to the next tier. That’s when $340 entered the picture.

Pinecone is solid if you have a simple retrieval use case and the budget to match. The pain shows up when you want fine-grained control over how filtering interacts with vector search — which, in my experience, is exactly what every real-world RAG implementation eventually needs.

The Setup Was Suspiciously Easy

I’d assumed moving to pgvector would mean standing up a new PostgreSQL instance somewhere. But we already had a Postgres 15 database running for application data on Railway. Adding pgvector was literally one command and a schema change:

-- Enable the extension (PostgreSQL 15, pgvector 0.7.x)
CREATE EXTENSION IF NOT EXISTS vector;

-- Document chunks table
CREATE TABLE doc_chunks (
  id BIGSERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  source TEXT NOT NULL,
  department TEXT,
  last_updated TIMESTAMPTZ,
  metadata JSONB,
  embedding VECTOR(1536)  -- dimensions for text-embedding-3-small
);

-- HNSW index for approximate nearest neighbor search
-- Chose HNSW over IVFFlat because we do continuous inserts
-- IVFFlat needs periodic VACUUM to stay accurate after bulk inserts
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

I picked HNSW over IVFFlat after reading through the pgvector README and a few GitHub issues — the discussion in pgvector issue #285 is worth reading if you’re facing this choice. Short version: IVFFlat is faster to build and leaner on memory, but it needs reindexing as you insert new rows. HNSW handles incremental inserts without that, which mattered since we were adding new documents daily.

The query side ended up cleaner than I expected:

import asyncpg
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def retrieve_chunks(
    query: str,
    conn: asyncpg.Connection,
    department: str | None = None,
    top_k: int = 5,
) -> list[dict]:
    response = await client.embeddings.create(
        input=query,
        model="text-embedding-3-small"
    )
    query_embedding = response.data[0].embedding

    # The <=> operator is cosine distance — lower = more similar
    where_clause = "WHERE 1 - (embedding <=> $1::vector) > 0.72"
    params: list = [query_embedding, top_k]

    if department:
        where_clause += " AND department = $3"
        params.append(department)

    rows = await conn.fetch(
        f"""
        SELECT content, source, department, last_updated,
               1 - (embedding <=> $1::vector) AS similarity
        FROM doc_chunks
        {where_clause}
        ORDER BY embedding <=> $1::vector
        LIMIT $2
        """,
        *params
    )
    return [dict(row) for row in rows]

One thing worth flagging: the similarity threshold (0.72 in my case) took real tuning. I started at 0.8 and kept getting empty results for vague queries. Dropped it to 0.7 and started getting noise. Landed on 0.72 after hand-checking about 200 test queries over a few days. Your mileage will vary — it depends heavily on your embedding model and how semantically clustered your documents are to each other.

I also hit a dumb mistake during migration: I initially wired up pgvector with a synchronous psycopg2 connection pool instead of asyncpg. As soon as more than a couple team members hit the Slack bot simultaneously, latency spiked and I couldn’t figure out why. Took me a few hours to realize I’d just serialized all my async queries through a blocking connection. Obvious in retrospect. asyncpg is the right call here.

Migration from Pinecone took about half a day total. Mostly waiting on OpenAI’s API to re-embed our 14,000 document chunks. The actual backfill script was 40 lines.

Two Weeks of Real Traffic — The Result I Didn’t Expect

I ran both systems in parallel for two weeks: every query hit Pinecone, then hit pgvector. I logged latency, top results, and did spot-checks on whether returned chunks were actually relevant.

Latency: pgvector averaged 28ms per query (p50) vs Pinecone at 22ms. Pinecone is faster. Not by a margin that matters for our use case — the OpenAI embedding call at the start takes 180-300ms anyway — but the gap is real. If you’re running high-frequency retrieval without an upstream embedding step, pay attention to this number.

Recall on simple factual queries: roughly equivalent. I wasn’t running formal benchmarks, just spot-checking real queries from the team, so take that with appropriate skepticism.

Here’s the part that genuinely surprised me. I expected metadata filtering to be a wash, or maybe slightly worse on pgvector. It was substantially better. Because pgvector lives inside PostgreSQL, you get the full WHERE clause with proper query planning. I could combine semantic similarity with:

WHERE department = 'engineering'
  AND last_updated > NOW() - INTERVAL '90 days'
  AND 1 - (embedding <=> $1::vector) > 0.72

And PostgreSQL would actually execute it efficiently, using the right indexes for each part of the predicate. Pinecone’s metadata filtering works inside their ANN search — filters reduce the candidate pool during the vector search itself, which can hurt recall in non-obvious ways depending on how selective your filter is. For structured retrieval where freshness and department filtering mattered a lot, this made a meaningful difference in result quality.

I thought I was making a cost/quality tradeoff. Turns out for our specific use case, pgvector was both cheaper and more accurate on filtered queries. Didn’t see that coming.

One Fewer Service, Which Matters More Than It Should

The cost drop — $340 down to roughly $40 in incremental Railway Postgres spend — was what prompted me to even look at this. But the reason I’m staying on pgvector is simpler: one fewer external service.

Before: PostgreSQL for application data, Pinecone for vectors, Redis for caching, OpenAI for embeddings. Now: PostgreSQL, Redis, OpenAI. One fewer credentials rotation. One fewer status page I’m checking when something’s slow and I can’t tell which component is to blame.

I also get real transactions. When a document is deleted in our app, I delete from doc_chunks in the same transaction. With Pinecone, I had to maintain a separate cleanup job to keep the vector store in sync — and it wasn’t always right. I found orphaned vectors in Pinecone pointing to documents we’d deleted months earlier. We were occasionally surfacing content from dead sources. Not a security issue for us, just subtle noise that was hard to notice unless you were actively looking.

One honest caveat on scale: I’m not confident pgvector holds up beyond a few million vectors without more aggressive index tuning. Our 14k chunks indexed in 3 seconds; I’ve read reports of memory pressure at multi-million scale, and the behavior there depends a lot on your server specs and HNSW configuration. At 50 million vectors, I’d want proper benchmarks before committing.

My Actual Recommendation

If you’re a solo dev or small team already running PostgreSQL, and your vector store is under a few million embeddings: stop paying for Pinecone and use pgvector. The operational simplicity is worth it on its own, and if you care about metadata filtering — which most real-world RAG use cases do — you’ll probably get better results anyway.

If you’re at a company running dedicated ML infrastructure at scale, or you specifically need Pinecone’s pod autoscaling or multi-tenant namespace management at thousands of namespaces, keep Pinecone. The managed experience is real and the enterprise features exist for a reason.

Anyway. The $340 invoice was the nudge I needed to actually evaluate this rather than stick with the first thing that worked. The first thing that worked wasn’t the right thing. That’s usually how it goes.

What Pinecone Was Actually Good At — And Where It Started to Annoy Me

The Setup Was Suspiciously Easy

Two Weeks of Real Traffic — The Result I Didn’t Expect

One Fewer Service, Which Matters More Than It Should

My Actual Recommendation

Leave a Comment Cancel Reply