Our PostgreSQL cluster crossed 10TB sometime in late 2024. I remember because that was also the week our reporting service started returning HTTP 504s intermittently, and the on-call rotation became a lot less fun. Seven engineers on the team, one database, and a growing pile of angry Slack messages from the analytics team.
This is what I learned over the next two months of digging.
How a 40-Second Query Killed Our Reporting Service
The first sign was a query in our reporting pipeline that had been running in under 500ms for months — suddenly averaging 40 seconds. Not always. Just… sometimes. Which is almost worse than always, because intermittent slowness is genuinely hard to reproduce.
My first instinct was “missing index.” Wrong. Or — okay, not entirely wrong, but that wasn’t the root cause. The table in question (user_events, ~800M rows, partitioned by month) had indexes. The query planner was using them. But the EXPLAIN ANALYZE output showed something I hadn’t seen before at this scale:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
user_id,
event_type,
COUNT(*) AS event_count
FROM user_events
WHERE
created_at >= '2024-10-01'
AND created_at < '2024-11-01'
AND event_type = ANY(ARRAY['click', 'purchase', 'view'])
GROUP BY user_id, event_type;
Index Scan using idx_user_events_created_at on user_events_2024_10
(cost=0.57..892341.22 rows=2847293 width=24)
(actual time=0.843..38421.337 rows=2847293 loops=1)
Buffers: shared hit=12847 read=1829403 dirtied=0 written=0
I/O Timings: read=36891.442 write=0.000
See that read=1829403 vs hit=12847? Buffer hit ratio under 1%. Everything was going to disk. But the index existed and was supposedly being used, so I spent two days assuming the problem was work_mem or shared_buffers being too low. Tuned both. Query was still slow.
The actual culprit was index bloat from dead tuples. Autovacuum hadn’t kept up with a particularly write-heavy period a few weeks earlier, and the index had swelled with dead entries. The planner was traversing enormous amounts of garbage to reach live rows.
What pg_stat_statements Actually Told Me (vs. What I Expected)
Before I arrived at the autovacuum diagnosis, I spent a few days inside pg_stat_statements. If you’re not already querying this regularly, you’re missing probably the most useful thing Postgres ships with.
SELECT
query,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(total_exec_time::numeric, 2) AS total_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
rows / NULLIF(calls, 0) AS avg_rows
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_exec_time DESC
LIMIT 20;
A few things surprised me. The query I was fixated on — the 40-second monster — wasn’t even in the top 5 by total execution time. Three other queries were collectively burning more CPU, each running thousands of times per day. Short calls, individually fast, but hammering the database constantly from a background job polling every 15 seconds. Nobody had noticed because no single call looked alarming.
So the reporting crisis was user-visible and urgent. But the background polling was quietly degrading our read replicas, and I hadn’t known about it until I looked here.
The stddev_exec_time column is something I now look at almost more than mean. A query with 200ms mean and 800ms stddev concerns me more than a flat 300ms mean. High variance usually signals something inconsistent — sometimes cache-warm, sometimes cold, sometimes hitting lock contention.
One thing I noticed after a planned failover: pg_stat_statements resets on PostgreSQL restart unless you’ve set pg_stat_statements.save = on. We hadn’t. I lost two weeks of history and had to start fresh. Set that to on immediately.
Why My Indexes Were Wrong, and What Index Bloat Actually Does to You
Here is the thing about partial indexes: I knew they existed. I’d used them on smaller databases. But at 800M rows, the difference between a full index and a well-scoped partial index is enormous — in query time and in ongoing maintenance cost.
We had:
CREATE INDEX idx_user_events_event_type
ON user_events (event_type, created_at);
When realistically 90% of our queries only touched the last 90 days and a handful of event types. The index covered three years of data we almost never queried. After analyzing access patterns, I replaced it with:
CREATE INDEX CONCURRENTLY idx_user_events_recent_active
ON user_events (event_type, user_id, created_at)
WHERE created_at >= '2024-01-01'
AND event_type IN ('click', 'purchase', 'view', 'signup');
Smaller index, faster scans, less bloat surface. Query time on the reporting query dropped from 40 seconds to about 1.2 seconds on the first run. I was genuinely shocked — I’d been expecting maybe a 50% improvement, not 97%.
One warning: I pushed this on a Friday afternoon. Specifically, a Friday at 4pm. CREATE INDEX CONCURRENTLY on a large table runs for hours (ours took 3.5), and while it avoids an exclusive lock on the table for most of that time, it does acquire locks during the final phase. It also hammers disk I/O during construction. Production slowed noticeably for about 20 minutes near the end. The analytics team filed a P2. I learned my lesson about Friday deployments the hard way.
For the bloat specifically — I diagnosed it with pgstattuple:
SELECT * FROM pgstattuple('idx_user_events_created_at');
The dead_leaf_percent was sitting at 31%. Anything over roughly 10% is worth addressing. A REINDEX CONCURRENTLY brought it down under 2% and immediately improved our buffer hit ratio. That single operation recovered most of the query time regression before I’d even touched the index structure itself.
Autovacuum Was the Thing Nobody on My Team Wanted to Talk About
Autovacuum is mentioned in every PostgreSQL tutorial. Most teams then ignore it until something breaks. We ignored it. Something broke.
user_events was receiving about 2 million writes per day. The default autovacuum settings — sized for much smaller tables — couldn’t keep up. Dead tuples accumulated, bloated the indexes, and eventually pushed the planner toward worse choices.
The fix was per-table autovacuum configuration:
ALTER TABLE user_events SET (
autovacuum_vacuum_scale_factor = 0.01, -- trigger when 1% of rows are dead (default 0.2)
autovacuum_analyze_scale_factor = 0.005,
autovacuum_vacuum_cost_delay = 2 -- less sleep between pages; default is 20ms
);
The default autovacuum_vacuum_scale_factor of 0.2 on an 800M-row table means autovacuum doesn’t trigger until 160 million dead rows have accumulated. That’s completely unworkable for a write-heavy table. Dropping it to 0.01 triggers at 8 million rows — still substantial, but manageable. Combined with bumping autovacuum_max_workers from 3 to 6 globally, the bloat stabilized within a week.
I’m not 100% sure this exact configuration scales beyond our specific write pattern. If your table takes large bulk deletes rather than steady row churn, the math changes. Your mileage may vary.
What genuinely caught me off guard: none of our monitoring dashboards were surfacing any of this. We had query time alerts, CPU alerts, disk alerts. We had zero visibility into n_dead_tup on our largest tables. I added a Grafana panel pulling from pg_stat_user_tables — n_dead_tup, last_autovacuum, n_tup_del — and within two days it revealed three other tables with similar bloat building up, none of which had degraded yet into visible slowness.
That dashboard has saved us more than any other single change from this whole project.
What I’d Do Differently Starting from Zero
The biggest mistake wasn’t any specific technical choice. It was treating database performance as a reactive problem — queries slowed, we investigated; bloat grew, we responded. We had no proactive visibility into what was quietly building up.
Honest answer: I’d flip the order of everything I actually did. The Grafana panel pulling from pg_stat_user_tables should have been week one, not week eight. Slow query logging should have been set to 500ms from the start, not 5 seconds — at 5 seconds you’re only seeing things that are already on fire. At 500ms you catch them while they’re still cheap to fix. I knew this in theory. I didn’t actually do it.
pg_stat_statements should be persisted across restarts and reviewed weekly — by a human, not just alert rules. Alerts catch acute failures. The queries that gradually degrade over three months, barely over your SLA at first and then slowly worse, only show up if someone actually sits down and looks at stddev_exec_time outliers on a schedule.
On indexes: treat them as a liability until proven otherwise. Every index is a maintenance contract autovacuum has to honor on every write. We dropped 14 of the 40+ indexes on user_events and write performance measurably improved — which still surprises me a little when I think back on it. Model access patterns first, use partial indexes for tables with temporal skew, and resist the urge to index your way out of a slow query before you’ve checked everything else.
And get comfortable with EXPLAIN (ANALYZE, BUFFERS), not just EXPLAIN. Buffer hit statistics — shared hits versus reads from disk — usually explain more than the row estimates do. Row estimates are frequently wrong anyway; buffer stats are ground truth.
We’re now running at roughly 95ms p95 on the reporting queries that were timing out four months ago. Same hardware, no infrastructure changes. Just profiling, index surgery, and finally taking autovacuum seriously.