Progressive delivery: canary with automated rollback on Flagger

Canary releases used to be an “experimental practice” only a few teams committed to: set up traffic shifting, collect metrics, compare them by hand, and decide — promote or roll back — separate work that rarely got finished. Flagger takes that work off people’s plates: one Canary CR on top of an ordinary Deployment, and from there the pipeline walks the release forward step by step, checking metrics, and decides on its own whether to promote or roll back.

Open Table of contents

The problem: a rollout nobody checks
What Flagger does
Three strategies — which to pick
Metric analysis: how “bad” is defined
Traffic shifting without a mandatory service mesh
Webhooks: a load test on every step
Wiring it into Flux and GitOps
Comparison at a glance
What you need to turn on canary with auto-rollback
How to verify the rollback actually works
Common pitfalls
Bottom line

The problem: a rollout nobody checks

A standard Kubernetes Deployment with a RollingUpdate strategy does one thing — gradually swap old pods for new ones. It knows nothing about error rate, latency, or business metrics. Progress runs on a fixed schedule: 25%, 50%, 75%, 100% — and at no step does anything ask “is this still okay?”

If the new version has a bug that only shows up under load, by the time it’s noticed via alerts or user complaints, all the traffic is already on the new version. There’s nothing to roll back to — the old pods are gone, so you have to bring the previous version back up from scratch, turning seconds of downtime into minutes.

A plain rollout with no metric analysis: the error is visible only after 100%

What Flagger does

Flagger is a controller from the Flux (CNCF) ecosystem that takes on exactly what a bare Deployment is missing: gradual traffic shifting and a metrics-driven decision about whether to continue. Everything is managed through one custom resource, Canary.

Here’s how it works: you deploy the new version as usual (update the image in the Deployment), Flagger notices, and instead of swapping directly it creates a shadow Deployment — the canary. It then clones the pods, gradually shifts a percentage of traffic to them, and at every step runs a metric analysis via MetricTemplate. If the metrics are within range, the step advances. If not, Flagger itself performs an abort and shifts 100% of traffic back to the stable version — no human involved.

Flagger's canary: step → analyze → promote or roll back

The key difference from a manual canary: the promote/rollback call isn’t made by an on-call engineer, it’s made by a controller against a threshold that’s already defined in git — faster, and with no “heroic hands” on the pager.

Three strategies — which to pick

Flagger supports more than canary. The right strategy depends on what risk you’re willing to take and what you can actually measure:

Canary — a gradual traffic shift (10% → 30% → 50% → 100%) with analysis at each step. The default choice for services with enough traffic that metrics at 10% are statistically meaningful.
Blue-green — the new version is deployed in full, tested against shadow traffic or a separate endpoint, and the switchover happens all at once. Useful when a gradual traffic split isn’t possible (no service mesh or ingress with weighted routing) or when even 10% of erroneous traffic is unacceptable.
A/B (header/cookie-based) — traffic is routed not by percentage but by rule: a specific header, cookie, or region goes to the new version. This isn’t about reliability, it’s about experimentation — useful when who sees the new version matters more than how much traffic it gets.

For SLO-based rollback (this post’s topic), canary is the usual pick — it gives the most signal for the smallest blast radius at each step.

Metric analysis: how “bad” is defined

Analysis is described by a MetricTemplate resource — essentially a query against Prometheus (or Datadog, CloudWatch, New Relic — Flagger is source-agnostic) with a threshold. A typical web-service set is error rate (share of 5xx) and p95 latency. Both are computed only against canary traffic, separately from the stable version — otherwise noise from the old version would drown out the signal.

At every step Flagger waits interval (e.g. one minute), collects the metric, and compares it to the threshold. If the threshold is breached threshold times in a row, the canary is marked failed and Flagger moves to abort. If maxWeight steps pass with no breach, the canary is promoted: the Deployment is actually updated to the new version, and the shadow canary is removed.

Metrics don’t have to be purely infrastructural. MetricTemplate is an arbitrary PromQL query, so you can put a business signal in there too — the share of successful payments, checkout conversion, latency on one specific heavy endpoint. A canary that’s technically flawless but tanks conversion is just as much a reason to roll back, and Flagger doesn’t distinguish between “HTTP broke” and “the funnel broke” as long as both are expressed as a metric with a threshold.

Traffic shifting without a mandatory service mesh

A common misconception is that progressive delivery requires Istio or Linkerd. Flagger can shift traffic through:

Gateway API — the modern standard, supports weighted routing out of the box, no mesh required;
ingress-nginx — via canary annotations, also mesh-free;
service mesh (Istio, Linkerd, App Mesh, Open Service Mesh) — for finer control (retries, circuit breaking) layered on the same canary mechanism.

That lowers the barrier to entry: a team with no mesh in the cluster still gets automated metric-based rollback — traffic just shifts at the Gateway API or ingress-controller level instead of through a sidecar proxy.

Webhooks: a load test on every step

Canary has hooks that fire at different stages of the analysis — pre-rollout, rollout, post-rollout. The most common example is a load-test webhook that triggers a load run (e.g. hey or k6 in a separate Job) at every canary step, so there’s actually traffic to analyze instead of silence from a mere 10% of organic traffic. Without load at a low traffic percentage, metrics can be too sparse to show anything.

Wiring it into Flux and GitOps

There’s nothing extra to set up — the Canary manifest is just YAML in the same git repository as the Deployment, MetricTemplate, or HelmRelease. Flux applies it along with everything else on sync. Metric-based rollback runs on top of the usual GitOps cycle: a CI pipeline bumps the image tag in git → Flux applies the change → Flagger picks up the new Deployment version and starts the canary analysis. Git stays the source of truth; Flagger adds a runtime check before the change actually reaches 100% of traffic.

Comparison at a glance

	`Deployment` + `RollingUpdate`	Flagger `Canary`
Traffic shift	on a pod schedule	in steps (10/30/50…%)
Metric analysis	none, only after-the-fact alerts	automatic, at every step
Rollback decision	a human, by hand	the controller, by threshold
Time to rollback	minutes to tens of minutes (someone noticed)	seconds to minutes (next analysis tick)
Service mesh required	no	no (Gateway API/ingress-nginx also work)
Config source	same git as everything else	same git (`Canary` CR)

What you need to turn on canary with auto-rollback

At minimum you need: Flagger installed, a Deployment + Service for the app, a MetricTemplate with a threshold, and the Canary itself. Here’s a working example with 10/30/50% steps and analysis on error rate and p95.

First, an error-rate metric template against Prometheus:

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
  namespace: prod
spec:
  provider:
    type: prometheus
    address: http://prometheus.monitoring:9090
  query: |
    100 - sum(
      rate(http_requests_total{namespace="prod",kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)?",status!~"5.."}[1m])
    )
    /
    sum(
      rate(http_requests_total{namespace="prod",kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)?"}[1m])
    ) * 100

Then the Canary itself, which takes over an existing Deployment and Service:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: web
  namespace: prod
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  service:
    port: 80
    targetPort: 8080
  analysis:
    interval: 1m
    threshold: 3          # consecutive failed checks before abort
    maxWeight: 50          # upper bound on canary weight before promote
    stepWeight: 10          # traffic shift step — 10, 30(*), 50%
    metrics:
      - name: error-rate
        templateRef:
          name: error-rate
        thresholdRange:
          max: 1           # no more than 1% errors
        interval: 1m
      - name: request-duration
        interval: 1m
        thresholdRange:
          max: 500         # p95 no more than 500ms
    webhooks:
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.prod/
        timeout: 15s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://web-canary.prod:80/"

threshold: 3 means three consecutive failed checks (on either metric) flip the canary to Failed, and Flagger rolls traffic back to the stable version — no human involved. stepWeight: 10 with maxWeight: 50 gives steps of 10 → 20 → 30 → 40 → 50%, after which, with no breaches, the final promote takes it to 100%.

How to verify the rollback actually works

Checking on production after the fact isn’t an option — you need a controlled negative test. Deploy a version that deliberately breaks some requests (say, a pod with a 20% chance of returning 500) and watch the Canary status:

kubectl -n prod get canary web -w

The expected sequence: Progressing → Weight climbs to the first step → on the next check error-rate exceeds max: 1 → status flips to Failed → Weight drops back to 0. You can also check the events:

kubectl -n prod describe canary web

The events will include an explicit Rolling back entry with the reason — which metric breached the threshold and by how much. If you don’t see that event and Weight keeps climbing despite an obviously broken canary, check that the MetricTemplate is actually looking at canary traffic (the kubernetes_pod_name label with the -canary suffix, substituted automatically via {{ target }}) and not the whole namespace.

Common pitfalls

Metric threshold too lax. If max for error rate is set to 5–10%, the canary will “survive” a real degradation — set the threshold against your historical error baseline, not a guess.
No load on the canary. With stepWeight: 10 and low overall traffic, 10% of canary traffic can amount to a couple of requests a minute — the metric ends up either empty or noisy. Without a load-test webhook, the analysis is meaningless.
MetricTemplate computed over the whole namespace. If the PromQL query doesn’t filter to the canary pod specifically (via the {{ target }} substitution), the metric mixes stable and new-version traffic, and the degradation gets lost in the noise.
maxWeight smaller than the sum of steps to 100%. With stepWeight: 10 and maxWeight: 50, the final promote happens at 50%, not gradually at 100% — that’s expected behavior, but it’s often mistaken for a stuck rollout.

Bottom line

Automated metric-based rollback removes the most expensive part of a release from the process — a human’s reaction time and their availability right at the moment of deploy. Flagger delivers that on top of an ordinary Deployment, without requiring a service mesh or a separate configuration system — the same Canary manifest ships from git along with everything else. The cost of entry is one CR and a MetricTemplate with a sane threshold; the payoff is a rollback in seconds instead of an alert, an investigation, and a manual kubectl rollout undo in the middle of the night.