A Case Study: When Staging Takes Down Prod

A week or two ago I came across a post on Reddit titled “Today I caused a production incident with a stupid bug”.

The engineer describes a harmless refactor around Redis serialization that accidentally corrupts shared cache data. And because of staging and production used the same Redis instance, when staging wrote incompatible data, every production server suddenly failed to deserialize configuration, fell back to the database and triggered a spike of full scan queries.

Mistakes like this are not rare. If the correctness of your production environment depends on every single engineer never making a slip, the system is already broken.

High reliability engineering starts from two core assumption:

  1. People will make mistakes

  2. Systems must be designed so those mistakes don’t become incidents

This case is a perfect example of what happens when that assumptions are not built into architecture and/or process.

The Short Version

  • A refactor moved Avro deserialization off a shared event loop into the caller thread (a good idea)

  • During the change the engineer accidentally used the wrong serializer

  • The error appeared once locally; after the corrupted format was written to Redis, it didn’t reproduce again

  • Dev environment was logging errors, but alerting was broken

  • The change was deployed to staging

  • Staging shared the same Redis as production (here we could end our story)

  • A staging wrote incompatible objects

  • Production failed to deserialize them and fell back to DB

  • DB CPU hit 100%, full scans flooded it, p99 latency spiked

Team reacted quickly and removed the staging server from rotation, then the system recovered.

Where Things Went Wrong

1. Staging was not isolated from production

This is the most dangerous (and common) anti pattern in the entire story. If staging can mutate production state, there is no staging — it’s another production with lower traffic. Well designed systems isolate environments by default.

2. Cache formats were not versioned or validated

Serializer mistakes happen. Refactors happen. Mistakes happen.

Without versioned formats, compatibility checks, integration tests, controlled rollouts of data mutations any change can become production breaking.

3. Alerting didn’t catch the early signals

Dev was showing errors but alerts weren’t sent. Prod suffered a slow degradation but without strong SLO-based monitoring it took humans to notice.

SRE basics:

  • Alert on symptoms (latency / error rate / DB load), not causes

  • Treat broken alerts as a P0 incident — it means you’re flying blind

4. There were no database fallback guardrails

After cache reads failed production services immediately hammered the database. Circuit breakers, throttling or even a “fail closed” path could have limited damage. Instead every reader fell back all at once — the classic cascading failure pattern.

5. The company relied on engineer perfection

The author mentions that the company does not do detailed code reviews and has almost no tests. It's the classic “we move fast” culture that works until it doesn’t.

Engineer’s job is not to write perfect code. It’s to build systems where imperfect code doesn’t take production down.

How It Could Be Prevented

Isolate environments. Staging should never be able to modify production data. Key namespaces or separate Redis instances do the trick.

Version cache schemas. Readers must accept both old and new formats during rollouts. Deploy new readers first, then new writers.

SLO-driven alerting. p95/p99 latency, error budget burn, DB pressure — these catch issues early.

DB fallback must be guarded. Never allow unlimited DB reads under cache failure. Add rate limits, circuit breakers, shadow reads, etc.

Human error is expected — build systems for it.

The Takeaway

The system failed to protect itself from a completely expected and ordinary human mistake. And that’s the core of pragmatic SRE thinking: human error is normal; design your system and processes for the reality that mistakes happen.

This case study is a reminder that resilience isn’t about preventing bugs — it’s about ensuring they never become outages.

Next
Next

Vibe Coding — Part 2: The Good, The Bad and The Ugly