My PostgreSQL Wake-Up Call: A 50ms Query Turned Into 30 Seconds

February 12, 2026

PostgresSQL Performance Tuning

Let me tell you about a bug that made me question everything I thought I knew about databases.

It started like any normal day.

Initially, the dashboard loads quickly and seems to work fine, with query times around 50ms and normal CPU temperatures, keeping the reader interested in the normal state before the issue arises.

On next day, same dashboard and query, but now the query takes 30 seconds, and the CPU is overheating.  Users are already complaining, and here’s the strange thing: we haven’t deployed anything, and traffic was normal.  So what changed?

🔍 Step 1: The Investigation

First instinct — check the query.

I ran:

And boom:

Hold on a minute… what?

Why is it performing a sequential scan?  There’s already an index on user_id. This doesn’t make sense.

⚡ The “Magic Fix” That Felt Like Luck

After some digging (and a bit of panic-Googling), I ran:

And just like that, the query time returned to 50ms and the CPU performance normalised. It felt like magic but honestly, it was just luck. I hadn’t fully grasped the underlying problem yet.


💥 What Was Actually Happening

Things took a turn for the interesting. The problem wasn’t the query or the index; it was PostgreSQL’s internals silently failing.

🧨 Problem #1: Table Bloat (The Silent Killer)

PostgreSQL does NOT delete rows immediately. Instead, it marks them as “dead” (because of MVCC).

Over time:

  • Dead rows pile up
  • Table size explodes
  • Query planner gets confused
  • Performance tanks

In my case:

  • Actual data: ~10GB
  • Table size: ~300GB 😳

So when PostgreSQL estimated the cost, it thought:

"Scanning the whole table is cheaper than using the index."

That’s why it switched to sequential scan.

Fix

  • Run VACUUM ANALYZE
  • Ensure autovacuum is actually working

💡 Better solution (what I implemented later)

Monitor

Tune:

  • autovacuum_vacuum_scale_factor
  • autovacuum_analyze_scale_factor

🧨 Problem #2: Autovacuum Was Basically Useless

At first I thought:

“Isn’t autovacuum supposed to handle this?”

Yes… but here’s the catch:

  • Autovacuum doesn’t run if there are long-running transactions.

And guess what we had? A background job holding a transaction open for hours.

That means:

  • Dead rows couldn't be cleaned
  • Bloat kept growing
  • Performance kept degrading

Fix

Find long-running transactions:

Kill them if needed.


🧨 Problem #3: Migration Lock Hell

Another day, another surprise. We ran a simple migration:

Should take seconds, right?

Nope.

  • It got stuck for minutes.

Then everything slowed down.

Then APIs started timing out.

Root cause?

  • Migration needed a lock
  • Some old transaction was still open
  • Lock couldn’t be acquired

And here’s the worst part:

  • All new queries started waiting behind the migration

It created a full system blockage.

Fix

Before running migrations:

Also monitor blockers:


🧨 Problem #4: Connection Exhaustion

We also hit this classic:

What happened?

  • Each API request opened a DB connection
  • One slow query held connections for ~30 seconds
  • All connections got used up

Game over.

Fix

We added connection pooling using PgBouncer (transaction mode).

Now:

  • Fewer active DB connections
  • Better throughput
  • More stability under load

🧨 Problem #5: Index Bloat (The Hidden One)

Another confusing issue:

  • Index existed
  • Query used index
  • Still slow

After checking:

  • Index size was insanely large.

Why?

Same reason as table bloat — dead entries inside index.

Fix

No downtime.


🧨 Problem #6: Random Latency Spikes

We saw something like:

  • 50ms… 50ms… 50ms…
  • suddenly 2 seconds 😐

Every few minutes. This one took time to understand.

Root cause: Checkpoints

PostgreSQL periodically flushes data to disk. During that time:

  • Disk I/O spikes
  • Queries slow down

Fix

Tune:

This spreads disk writes instead of spiking.


💡 What I Changed After This Incident

After going through all this, I stopped treating PostgreSQL like a “set and forget” system.

Here’s what I now do in every project:

🔍 Monitoring (Non-Negotiable)

  • pg_stat_activity → active queries
  • pg_stat_user_tables → dead tuples
  • pg_stat_statements → slow queries

⚙️ Preventive Setup

  • Enable and tune autovacuum
  • Use PgBouncer
  • Set lock_timeout for migrations
  • Schedule periodic:
    • VACUUM
    • REINDEX CONCURRENTLY

🧠 Query Awareness

I no longer just “write queries”. I always check:

  • Is it using index?
  • Is row estimate correct?
  • Is planner making wrong decisions?

If You’re Facing Something Similar…

Let me ask you:

  • Did your query suddenly slow down without code changes?
  • Is your table size much larger than actual data?
  • Are queries randomly fast and slow?

If yes…

You’re probably not dealing with a query problem.

  • You’re dealing with a PostgreSQL maintenance problem.

Other Possible Solutions (Beyond What I Did)

Depending on your scale, you can also consider:

  • Partitioning: Split large tables → faster scans, better vacuum efficiency
  • Read Replicas: Offload heavy reads
  • Better Index Strategy: Avoid over-indexing (causes more bloat)
  • Archiving Old Data: Keep tables small and manageable

🚀 Final Thought

  • PostgreSQL doesn’t usually fail loudly.
  • It fails silently… slowly… until one day everything breaks.
  • And when it does, it looks like:
  • "Nothing changed… but everything is slow."
  • That’s the trap.