How “Safe” Defaults Blow Up Production

Field Notes

Sep 9

Incidents rarely need exotic triggers or edge cases. Human factors often lead to configuration choices — and using defaults — that can cause serious damage. We tend to treat defaults as something "obviously working", therefore safe: someone must have thought this through. So one can safely follow the beaten path. But where does that paved road lead? Are those defaults helpful or harmful?

With that in mind, a few ground rules.

General Rules of Thumb

If a value is sensible for 80% or more users, make it the default. If no single value fits the majority, do not set a default — force an explicit choice.
Reasonable defaults prevent tons of stupid mistakes. Fewer people touch the config when it already works for them.
Design for human beings under pressure. Default values should be safe when people are tired, stressed, or just lack the necessary knowledge in a particular area; they exist to protect people from making mistakes.
Bad defaults are worse than no defaults. A wrong paved road silently guides everyone over a cliff.

Defaults in the Wild (unexpected consequences)

1. CI branch filter omitted -> feature branch hit production

Scenery. Friday, lunch time. We keep a manual prod deployment job for hotfixes. It isn’t tied to a branch (no main‑only guard) and defaults to the commit/artifact of the page you run it from. From the feature/refactor-pricing someone clicked Run after a clean test run.

Observations. The canary silently takes 10% of traffic. Error graphs bloom: the new code expects a feature flag that isn’t enabled in prod. Average order value plunges; dozens of $0-price carts appear in logs; Slack wakes up.

What happened. Because the prod job was not bound to a branch, it defaulted to the commit on the page it was launched from.

Lesson. CI/CD prod jobs defaults must be prohibitive: explicit branch allow‑list, environment approvals, and gate canary shifts on config/flags. The 10% canary did what it should, but the lack of a branch guard/approval eventually turned regular feature build into surprise prod release.

2. Flag service outage -> experiment went to 100%

Scenery. Cart‑personalization A/B at 5%: a new checkout path (personalized pricing/recommendations), gated by a client‑side feature flag and targeted at a small cohort. Our flag provider blips for a few minutes. The client SDK is fail‑open: when the provider is unreachable, it treats flags as ON.

Observations. p95 latency +120 ms, cache miss rate +30%, and cloud costs went to the moon. Support tickets: "the cart is slow".

What happened. A well‑meant "keep the UI working" fallback quietly enrolled the entire user base as test subjects.

Lesson. Fallbacks must be closed by default: use last‑known safe state.

3. Empty partition key -> one broker melted

Scenery. An internal publisher added partition_key with default value of the empty string. Most callers left it blank. Hashing "" routed every event to the same partition on one broker.

Observations. One partition ran 10x hotter; the host squeezed the box dry - CPU maxed, disks thrashing; producers timed out; consumer lag on the hot shard grew into hours.

What happened. The empty string was treated as a valid key. A zero-effort default that turned the cluster into a single-partition system.

Lesson. Reject ambiguous values ("", *, null) and use safe strategy (sticky/round‑robin) by default.

4. Airflow catchup=True -> surprise two‑year backfill

Scenery. A new DAG is created with start_date two years back and no explicit catchup. Airflow’s default (catchup=True) queues 750 historical runs at the first scheduler tick.

Observations. Providers API rate‑limits us, after the warehouse load jumps 4x, and the on‑call team spends the magical night chasing phantom backfills.

What happened. A convenience default ("always catch up") becomes a time bomb when start_date is historical.

Lesson. Backfill is opt‑in with a visible preview (“this change enqueues N runs”), strict parallelism, and guard windows for historical jobs.

Some will say the cases above look dumb. They’re right. This is exactly what happens in the real world.

And I am sure that each of you can remember more than one such story, when the default settings played a cruel joke on us.

Defaults — Quick Reminder

Implicit "all" scope -> Require explicit scope; dry‑run on; hard caps.

Production defaults to write access -> Read‑only by default.

Fail‑open fallbacks (flags/auth/config) -> Fail closed with last‑known safe state; explicit override only.

Ambiguous sentinels ("", *, null) -> Reject; choose safe strategies (sticky/round‑robin; explicit include lists).

No timeouts / unlimited retries -> Conservative timeouts; circuit breakers.

Backups short retention -> Tiered retention; regular restore checks.

Final Thoughts

Bad defaults are worse than no defaults. Use the 80% rule, fail closed, require explicit scopes, make defaults visible (previews, counts, next run times), and instrument default‑driven risk (partition skew, mass‑delete attempts, backfill volume).

Good defaults make life much easier and protect us from many mistakes. Take the time to choose them carefully.

Bonus story

Back in 2020 I arrived in Georgia by ferry and had to spend a 10‑day quarantine in a "government‑certified COVID hotel.” Mine was in Shekvetili, right on the seaside with a gorgeous view. The wifi however barely worked, and mobile data didn’t reach at all because of the beach’s magnetic sands acted like a natural signal jammer.

Thanks to sheer luck (and the local sysadmin), the hotel’s network hardware still used default credentials. A quick nmap and some googling, and I was in. I set QoS to prioritize the MAC addresses of my devices and finally got a relatively stable connection. To cut the boredom and maintain sanity while sitting in a hotel room for 10 days in a row, I even wandered into the CCTV console.

Being a decent guest, I rolled everything back before checkout, so future visitors would enjoy internet access as it was.

Yaugen Drybin