Kubernetes Upgrades for Startups: A No-Drama Playbook

Sep 16

Any startup always has features to ship, revenue to chase, and bugs that won’t fix themselves. Infra work and Kubernetes upgrades reliably slide to the backlog. Then the day comes and you receive EOL notification for the running version of your cluster and together with it — “hello there, four minor versions ahead, we have everything red”. And suddenly you’re staring at a multi-version leap with breaking API changes, incompatible Helm charts and CRDs we paid for in blood, and a nervous release manager. Upgrading Kubernetes is not like bumping a library in one service — it’s the whole platform surgery. If you slip, everything will fall apart.

This is the no-drama playbook: field-tested, light on ceremony, tends toward actions that cut risk and reduce failure domain.

Core principles

Upgrade small, upgrade often. Respect version-skew policies and avoid hoarding minor versions till the end. The goal is to turn upgrades into routine hygiene, not annual heroics.
Measure first, then cut. Test on lower environments that mirror production infra (not just app stack). Use metrics and logs as your guide on what will break — not gut feelings.
Rollback is a feature. Have a clear path to “put things back” (etcd snapshot + object/volume backups + traffic failback). If you can’t rollback within minutes, you’re already gambling, not upgrading.
Use guardrails. Enforce rules that prevent old / known mistakes (no deprecated APIs, sane PDBs, no :latest images) from the beginning, not in a postmortem.
Own the compatibility matrix. Treat “K8s <-> CNI <-> ingress <-> autoscaler <-> CSI <-> operators <-> CRDs” as a living file that you update before every move.

Preflight checklist (pasteable into a ticket)

Target version + compatibility matrix. Pin K8s and every critical dependency (CNI, Ingress Controller, Cluster Autoscaler, CSI, operators, CRDs). Note minimums and breaking changes.
Find deprecated API usage via the apiserver_requested_deprecated_apis metric and audit logs; generate hit list by kind/namespace.
Fix manifests at scale through kubectl convert; lint with pluto and kube-no-trouble (kubent); schema-validate with kubeconform.
Verify CRDs. Serve multiple versions where needed, wire a conversion webhook, and plan a Storage Version Migration post-upgrade.
Review PDBs and deployment strategies (maxUnavailable / maxSurge) for critical workloads.
Prepare backups. Take an etcd snapshot and a Velero backup of objects/PVs.
Book a maintenance window, declare code freeze, set go/no-go and rollback criteria.

Preflight, for real

Let’s look at the practical part. Turn the checklist into a runnable preflight: find deprecations, confirm compatibility, try on lower environment, set up rollback, and tune rollout. When these pass, start the upgrade.

Recon: find what will break before it breaks

Watch the metrics: Add a dashboard for apiserver_requested_deprecated_apis. It tells you exactly which deprecated groups/versions are still in use and how often. Combine that with audit logs of deprecated calls — and there is a hit list of objects to fix.

Update manifests at scale:

kubectl convert -f path/ gives you updated API versions for built-in resources.
pluto and kubent scan your repos and Helm charts for deprecated APIs.
kubeconform validates YAMLs (including CRDs) against upstream schemas quickly.

Block regressions from the start: Use Kyverno or Gatekeeper (OPA) policies to deny deprecated API versions, reject :latest images, and require PDBs on certain labels. If it can't get applied, it can't break your upgrade.

Compatibility is not just Kubernetes

CNI and Ingress. Calico/Cilium/others and Nginx/Envoy/AWS/GCP ingress controllers all have minimum supported K8s versions and their own upgrade notes with deprecated flags and annotations behaviours. Pin versions and upgrade them before the control plane if they require newer APIs.

Cluster Autoscaler. Keep it aligned with your cluster version. Out-of-sync autoscalers can behave weirdly under real traffic or ignore node groups entirely.

Operators. Upgrade them to versions that support the target K8s and CRD versions, verify their validating/mutating webhooks, RBAC and leader‑election settings, and follow their upgrade path before touching workloads.

Runtimes & node images. Standardize on containerd, keep AMIs/base images current, and ensure kubelet/kube-proxy image compatibility.

Staging that helps

Common situation: "stage" cluster mirrors the app but not the platform. Fix: upgrade this cluster to the same controller stack, CNI, node images, autoscaler settings, and ingress providers.

On that staging:

Run e2e/integration tests.
Run load tests; watch scheduler queue times, kube-proxy health, CoreDNS latency, eviction counts, and Pending pods.
Kill a node on purpose; ensure PDBs and autoscaler do what you expect they do.

Backup and rollback

It takes thirty minutes and you will never regret it:

etcd snapshot before control-plane upgrades (or follow your provider’s control-plane snapshot procedure).
Velero backup of cluster objects and PVs.
Traffic failback plan: it would be nice to keep a known pool of good nodes warm (or sibling cluster) and be ready to switch traffic if SLOs degrade beyond thresholds.

Avoid the Schrödinger's backup — learn how to actually perform a restore before moving on.

Managed clusters

GKE/AKS/EKS support surge/rolling upgrades for node pools. Tune maxSurge and maxUnavailable to move faster without violating PDBs. Start with the canary pool; expand it if SLOs hold. On large fleets, confirm that HPA/Cluster Autoscaler can backfill under load, and smoothly raise surge to shorten the window without lighting up pager duty.

The playbook (step by step)

0) Preparation (a week out)

Lock the target K8s and component versions; fill the compatibility matrix with links to release notes and “min supported” notes.
Add the deprecated-APIs dashboard + alerts.
Run kubectl convert, pluto, kubent, and kubeconform across all repos/charts; open tickets for required fixes.
Audit CRDs: confirm multiple served versions and conversion if needed; plan the storage migration for after upgrade.
Validate PDBs and deployment strategies by actually draining in staging.
Take etcd + Velero backups (and make sure you know how to restore).

1) Trial run (staging)

Execute the upgrade procedure on staging exactly as you plan to do so in prod and time it.
Force-test the bad scenarios: kill node, spike CPU, degrade DNS, pause autoscaler (be realistic and creative). Adjust the upgrade plan where reality differs.
Note metrics and logs as a “green baseline” to compare during prod.

2) Upgrade (prod)

Code freeze. Announce the plan. Open a “war room” with clear roles and a single timeline owner.
Final backups: fresh etcd snapshot + Velero.
Upgrade order: control plane -> workers. On self-managed kubeadm: drain -> upgrade kubelet -> uncordon, one node at a time.
Node waves: start with the canary pool, watch SLOs, then roll remaining pools with tuned surge/parallelism.
Observe everything: user SLOs first (latency/error rate), then platform (CoreDNS, kube-proxy, scheduler, etcd alarms, eviction rates, event flood).
Rollback criteria: define thresholds; e.g., p95 latency +X% for Y minutes, error rate > Z% -> immediate stop and rollback. Don't negotiate this in the moment.

3) Aftercare

Take a new etcd snapshot on the upgraded control plane (so DR matches reality).
Run storage version migration for CRDs if applicable.
Re-run pluto/kubent to catch any remaining deprecated resources.
Update the compatibility matrix and the runbook with “we learned / next time” notes.
Remove temporary canary rules and surge overrides if you changed them.

Common traps

Everything is green until load spikes. Your PDBs are too strict or autoscaler/controller versions don’t match. Trial run on staging with load and confirm node group scaling works.
CRDs objects decode incorrectly after the upgrade. You skipped conversion or storage migration. Serve multiple versions and migrate stored objects; better not to leave etcd holding old pieces.
Ingress “mostly works”. You upgraded the cluster without bumping the ingress controller to a version that supports the new APIs. Upgrade ingress first and validate annotations/flags.
OS/image drift. Nodes with stale images or mixed runtimes look fine in daylight — then boom at 3 AM. Standardize on containerd; automate security patches with controlled reboots (kured).

Final notes

No one pays you for “running the latest Kubernetes”. Business pays for uptime and for product delivery. Regular, small upgrades are the most cheap way to buy both. Either you invest a few predictable hours on a ritual, or you face a storm once a year where one bad move costs the company a day of revenue and you a week of sleep.

My short recipe:

Treat upgrades as a process, not an event.
Good preparation makes upgrades safe and predictable — invest in preflight, trial runs, and a rollback plan.
Monitor deprecated API; identify requests via audit logs; block regressions with policies/linters.
Keep DR one click away (etcd/Velero).
Move little and often so upgrades never block the business.

A stable platform is a startup multiplier. Fewer flaky incidents, fewer 3 AM war rooms, and more time building the thing you’re actually here to build. Upgrade small and often — until it’s boring.

Boring infra wins.

Yaugen Drybin