My PostgreSQL Wake-Up Call: A 50ms Query Turned Into 30 Seconds
February 12, 2026

Let me tell you about a bug that made me question everything I thought I knew about databases.
It started like any normal day.
Initially, the dashboard loads quickly and seems to work fine, with query times around 50ms and normal CPU temperatures, keeping the reader interested in the normal state before the issue arises.
On next day, same dashboard and query, but now the query takes 30 seconds, and the CPU is overheating. Users are already complaining, and here’s the strange thing: we haven’t deployed anything, and traffic was normal. So what changed?
🔍 Step 1: The Investigation
First instinct — check the query.
I ran:
And boom:
Hold on a minute… what?
Why is it performing a sequential scan? There’s already an index on user_id. This doesn’t make sense.
⚡ The “Magic Fix” That Felt Like Luck
After some digging (and a bit of panic-Googling), I ran:
And just like that, the query time returned to 50ms and the CPU performance normalised. It felt like magic but honestly, it was just luck. I hadn’t fully grasped the underlying problem yet.
💥 What Was Actually Happening
Things took a turn for the interesting. The problem wasn’t the query or the index; it was PostgreSQL’s internals silently failing.
🧨 Problem #1: Table Bloat (The Silent Killer)
PostgreSQL does NOT delete rows immediately. Instead, it marks them as “dead” (because of MVCC).
Over time:
- Dead rows pile up
- Table size explodes
- Query planner gets confused
- Performance tanks
In my case:
- Actual data: ~10GB
- Table size: ~300GB 😳
So when PostgreSQL estimated the cost, it thought:
"Scanning the whole table is cheaper than using the index."
That’s why it switched to sequential scan.
✅ Fix
- Run VACUUM ANALYZE
- Ensure autovacuum is actually working
💡 Better solution (what I implemented later)
Monitor
Tune:
- autovacuum_vacuum_scale_factor
- autovacuum_analyze_scale_factor
🧨 Problem #2: Autovacuum Was Basically Useless
At first I thought:
“Isn’t autovacuum supposed to handle this?”
Yes… but here’s the catch:
- Autovacuum doesn’t run if there are long-running transactions.
And guess what we had? A background job holding a transaction open for hours.
That means:
- Dead rows couldn't be cleaned
- Bloat kept growing
- Performance kept degrading
✅ Fix
Find long-running transactions:
Kill them if needed.
🧨 Problem #3: Migration Lock Hell
Another day, another surprise. We ran a simple migration:
Should take seconds, right?
Nope.
- It got stuck for minutes.
Then everything slowed down.
Then APIs started timing out.
Root cause?
- Migration needed a lock
- Some old transaction was still open
- Lock couldn’t be acquired
And here’s the worst part:
- All new queries started waiting behind the migration
It created a full system blockage.
✅ Fix
Before running migrations:
Also monitor blockers:
🧨 Problem #4: Connection Exhaustion
We also hit this classic:
What happened?
- Each API request opened a DB connection
- One slow query held connections for ~30 seconds
- All connections got used up
Game over.
✅ Fix
We added connection pooling using PgBouncer (transaction mode).
Now:
- Fewer active DB connections
- Better throughput
- More stability under load
🧨 Problem #5: Index Bloat (The Hidden One)
Another confusing issue:
- Index existed
- Query used index
- Still slow
After checking:
- Index size was insanely large.
Why?
Same reason as table bloat — dead entries inside index.
✅ Fix
No downtime.
🧨 Problem #6: Random Latency Spikes
We saw something like:
- 50ms… 50ms… 50ms…
- suddenly 2 seconds 😐
Every few minutes. This one took time to understand.
Root cause: Checkpoints
PostgreSQL periodically flushes data to disk. During that time:
- Disk I/O spikes
- Queries slow down
✅ Fix
Tune:
This spreads disk writes instead of spiking.
💡 What I Changed After This Incident
After going through all this, I stopped treating PostgreSQL like a “set and forget” system.
Here’s what I now do in every project:
🔍 Monitoring (Non-Negotiable)
- pg_stat_activity → active queries
- pg_stat_user_tables → dead tuples
- pg_stat_statements → slow queries
⚙️ Preventive Setup
- Enable and tune autovacuum
- Use PgBouncer
- Set lock_timeout for migrations
- Schedule periodic:
- VACUUM
- REINDEX CONCURRENTLY
🧠 Query Awareness
I no longer just “write queries”. I always check:
- Is it using index?
- Is row estimate correct?
- Is planner making wrong decisions?
If You’re Facing Something Similar…
Let me ask you:
- Did your query suddenly slow down without code changes?
- Is your table size much larger than actual data?
- Are queries randomly fast and slow?
If yes…
You’re probably not dealing with a query problem.
- You’re dealing with a PostgreSQL maintenance problem.
Other Possible Solutions (Beyond What I Did)
Depending on your scale, you can also consider:
- Partitioning: Split large tables → faster scans, better vacuum efficiency
- Read Replicas: Offload heavy reads
- Better Index Strategy: Avoid over-indexing (causes more bloat)
- Archiving Old Data: Keep tables small and manageable
🚀 Final Thought
- PostgreSQL doesn’t usually fail loudly.
- It fails silently… slowly… until one day everything breaks.
- And when it does, it looks like:
- "Nothing changed… but everything is slow."
- That’s the trap.