My PostgreSQL Wake-Up Call: A 50ms Query Turned Into 30 Seconds

February 12, 2026

Let me tell you about a bug that made me question everything I thought I knew about databases.

It started like any normal day.

Initially, the dashboard loads quickly and seems to work fine, with query times around 50ms and normal CPU temperatures, keeping the reader interested in the normal state before the issue arises.

On next day, same dashboard and query, but now the query takes 30 seconds, and the CPU is overheating. Users are already complaining, and here’s the strange thing: we haven’t deployed anything, and traffic was normal. So what changed?

🔍 Step 1: The Investigation

First instinct — check the query.

I ran:

And boom:

Hold on a minute… what?

Why is it performing a sequential scan? There’s already an index on user_id. This doesn’t make sense.

⚡ The “Magic Fix” That Felt Like Luck

After some digging (and a bit of panic-Googling), I ran:

And just like that, the query time returned to 50ms and the CPU performance normalised. It felt like magic but honestly, it was just luck. I hadn’t fully grasped the underlying problem yet.

💥 What Was Actually Happening

Things took a turn for the interesting. The problem wasn’t the query or the index; it was PostgreSQL’s internals silently failing.

🧨 Problem #1: Table Bloat (The Silent Killer)

PostgreSQL does NOT delete rows immediately. Instead, it marks them as “dead” (because of MVCC).

Over time:

Dead rows pile up
Table size explodes
Query planner gets confused
Performance tanks

In my case:

Actual data: ~10GB
Table size: ~300GB 😳

So when PostgreSQL estimated the cost, it thought:

"Scanning the whole table is cheaper than using the index."

That’s why it switched to sequential scan.

✅ Fix

Run VACUUM ANALYZE
Ensure autovacuum is actually working

💡 Better solution (what I implemented later)

Monitor

Tune:

autovacuum_vacuum_scale_factor
autovacuum_analyze_scale_factor

🧨 Problem #2: Autovacuum Was Basically Useless

At first I thought:

“Isn’t autovacuum supposed to handle this?”

Yes… but here’s the catch:

Autovacuum doesn’t run if there are long-running transactions.

And guess what we had? A background job holding a transaction open for hours.

That means:

Dead rows couldn't be cleaned
Bloat kept growing
Performance kept degrading

✅ Fix

Find long-running transactions:

Kill them if needed.

🧨 Problem #3: Migration Lock Hell

Another day, another surprise. We ran a simple migration:

Should take seconds, right?

Nope.

It got stuck for minutes.

Then everything slowed down.

Then APIs started timing out.

Root cause?

Migration needed a lock
Some old transaction was still open
Lock couldn’t be acquired

And here’s the worst part:

All new queries started waiting behind the migration

It created a full system blockage.

✅ Fix

Before running migrations:

Also monitor blockers:

🧨 Problem #4: Connection Exhaustion

We also hit this classic:

What happened?

Each API request opened a DB connection
One slow query held connections for ~30 seconds
All connections got used up

Game over.

✅ Fix

We added connection pooling using PgBouncer (transaction mode).

Now:

Fewer active DB connections
Better throughput
More stability under load

🧨 Problem #5: Index Bloat (The Hidden One)

Another confusing issue:

Index existed
Query used index
Still slow

After checking:

Index size was insanely large.

Why?

Same reason as table bloat — dead entries inside index.

✅ Fix

No downtime.

🧨 Problem #6: Random Latency Spikes

We saw something like:

50ms… 50ms… 50ms…
suddenly 2 seconds 😐

Every few minutes. This one took time to understand.

Root cause: Checkpoints

PostgreSQL periodically flushes data to disk. During that time:

Disk I/O spikes
Queries slow down

✅ Fix

Tune:

This spreads disk writes instead of spiking.

💡 What I Changed After This Incident

After going through all this, I stopped treating PostgreSQL like a “set and forget” system.

Here’s what I now do in every project:

🔍 Monitoring (Non-Negotiable)

pg_stat_activity → active queries
pg_stat_user_tables → dead tuples
pg_stat_statements → slow queries

⚙️ Preventive Setup

Enable and tune autovacuum
Use PgBouncer
Set lock_timeout for migrations
Schedule periodic:
- VACUUM
- REINDEX CONCURRENTLY

🧠 Query Awareness

I no longer just “write queries”. I always check:

Is it using index?
Is row estimate correct?
Is planner making wrong decisions?

If You’re Facing Something Similar…

Let me ask you:

Did your query suddenly slow down without code changes?
Is your table size much larger than actual data?
Are queries randomly fast and slow?

If yes…

You’re probably not dealing with a query problem.

You’re dealing with a PostgreSQL maintenance problem.

My PostgreSQL Wake-Up Call: A 50ms Query Turned Into 30 Seconds

🔍 Step 1: The Investigation

🧨 Problem #1: Table Bloat (The Silent Killer)

🧨 Problem #2: Autovacuum Was Basically Useless

🧨 Problem #3: Migration Lock Hell

🧨 Problem #4: Connection Exhaustion

🧨 Problem #5: Index Bloat (The Hidden One)

🧨 Problem #6: Random Latency Spikes

💡 What I Changed After This Incident

Other Possible Solutions (Beyond What I Did)