Post-OTA verification

The keystone feature. Diagnostics are useful on their own — operators can run them on demand — but the moment you set run_after_ota: true on one, you've turned it into a release-gate. The platform automatically runs it on every device after every successful OTA, aggregates the results, and rolls back the release if any device's check fails.

This page covers the full lifecycle, the auto-rollback behavior, and the safety mechanisms that keep a buggy diagnostic from cascading into a fleet-wide outage.

The lifecycle, end to end

release.applied (per gateway, fired by service-app on AppliedReport success)
    │
    ▼
service-app queries diagnostic_definitions WHERE run_after_ota = true
    │
    ├── (zero results) ─────► gateway flips to "applied" (current behavior; no verify step)
    │
    └── (one or more)
        │
        ▼
    service-app mints run_id, inserts pending diagnostic_runs rows
    service-app publishes one cmd per diagnostic to the gateway
    gateway flips to "verifying"
        │
        ▼
    chip dispatches each diagnostic, publishes results back on
    {ns}/{gw}/diagnostics/result with envelope v2 (run_id + triggered_by="ota_verify")
        │
        ▼
    service-app result-consumer inserts each result
    after each insert: count pending; if 0, decide:
        │
        ├── any failed/timeout/error ─► gateway flips to "verification_failed"
        │                                     │
        │                                     ▼
        │                              auto-rollback fires (see below)
        │
        └── all pass ────────────────► gateway flips to "verified"
                                              │
                                              ▼
                                       code_project.current_release_id is set
                                       (the "in production" badge flips green)

A separate reaper runs every 30 s in service-app. It finds rows still pending past max(timeout_secs) * 1.5 and flips them to timeout, which triggers the same finalize check. Wall-clock cap with the largest allowed timeout_secs=300: verification gives up after 7.5 minutes per gateway.

Per-gateway granularity, per-release rollup

Each gateway's verification is independent. They share a release_id but each has its own run_id, its own diagnostic_runs rows, and its own status transition.

The release-level status flips to verified only when all applied gateways reach verified. The release flips to verification_failed the moment any applied gateway reaches verification_failed. So a 1,000-gateway rollout doesn't have to wait for everyone to finish — the bad release self-aborts on the first failure and the rest of the fleet stops getting it.

Auto-rollback fans out: when the release-level state flips to verification_failed, the cloud rolls back every gateway that received the release, not just the ones that failed verification. This is the point of the feature — you don't want some gateways stranded on a known-bad release.

Auto-rollback

Always-on by default. When verification fails on a gateway:

  1. The cloud looks up that gateway's previous successfully-verified release.
  2. It re-pins the gateway to that release with force=true, fanning out a fresh release.apply.
  3. It emits gateway.release.auto_rolled_back to the audit log (severity 3).
  4. The dashboard surfaces a banner on the gateway and on the release detail page: "Auto-rolled back gateway X to release Y due to verification failure."

The previous release was already verified at last-apply, so the platform trusts it. The chip downloads + installs again, re-runs its run_after_ota diagnostics for the previous release (yes, even on rollback), and lands back in verified.

Loop guard

If the rolled-back release also fails verification, the platform does not cycle further. Instead:

  • The gateway's release status flips to verification_storm.
  • An audit event gateway.release.verification_storm is emitted.
  • A red banner appears in the gateway detail page.
  • Manual operator intervention is required to re-enable auto-deploys for that gateway.

The loop guard exists because there are real failure modes (a sensor died after being installed; a firmware change made an existing sensor's reading invalid) where every release in your history would fail the same check on the same device. Without the guard, the platform would cycle forever between two releases. With it, the platform halts loudly the second time and asks for human input.

Storm protection

If a project hits 3 consecutive verification failures across distinct releases within 24 hours, the platform auto-disables the project's auto_rollback_on_verification_failure setting. This is the global kill-switch — once it trips:

  • Future verification failures still flip releases to verification_failed (the diagnostic information is preserved).
  • The cloud no longer auto-rolls back. The fleet stays on whatever's running.
  • An audit event project.auto_rollback.storm_disabled is emitted.
  • A banner appears on the project's release page asking the operator to investigate.
  • A verification_storm_disabled_until timestamp is set; auto-rollback re-enables 24 h later, OR the operator can hit the Re-enable button in the verification panel banner immediately.

This is the safety net behind the "always-on" default. It exists for cases where you wrote a diagnostic that's flaky in a way you didn't catch in testing — without it, three buggy releases in a day would cycle every device on the project. With it, the platform stops auto-acting after the third failure and forces you to look at what's happening.

State summary

The release status state machine, including the new states this feature adds:

requested ──► applied
              │
              ├─► (no run_after_ota diagnostics) ───────► (terminal: applied)
              │
              └─► verifying ──► verified ───────────────► (terminal: verified)
                            │
                            └─► verification_failed ────► auto-rolled back to previous
                                                          │
                                                          └─► (rollback also fails)
                                                                  │
                                                                  ▼
                                                          verification_storm
                                                          (terminal; manual recovery)

verifying, verified, verification_failed, and verification_storm are the new states. Projects with zero run_after_ota: true diagnostics never see them — the gateway flips straight from pending to applied as it always has.

What you control vs. what the platform handles

You control:

  • Which diagnostics run after OTA (the run_after_ota: true field per YAML).
  • The timeout_secs per diagnostic (max 300; the verification window is max(timeout_secs) * 1.5).
  • Whether auto-rollback is enabled per project (defaults true; your namespace settings expose it).
  • When to re-enable after a storm-disable (button in the verification panel).

The platform handles:

  • Fanning out one diagnostic.run cmd per gateway per run_after_ota diagnostic on every apply.
  • Aggregating the per-gateway results into the per-release verdict.
  • Maintaining the loop guard and storm-disable counters.
  • Rolling back automatically (when enabled) without operator action.
  • Surfacing every state transition in the dashboard verification panel and the audit log.

Edge cases

CaseBehavior
Gateway is offline at OTA timeThe verification hook only fires after RecordApplied succeeds, which only fires when the chip publishes its AppliedReport. Offline chips queue the OTA notification at the broker; verification starts when they come back online and successfully apply.
Gateway applies but disconnects before publishing diagnostic resultsPending rows time out per the reaper window above. The release transitions to verification_failed and auto-rollback fires. The chip will get the rollback whenever it next reconnects.
New release promoted while verification is in flight on a previous releaseAllowed. The new release supersedes the in-flight verification — pending verify rows are abandoned (status flipped to superseded), and verification starts fresh against the new release once it applies.
Operator hits "Retry promote" during verificationSame as fresh promote — supersede + restart. The chip is on the right release; the cloud re-publishes verification cmds with a new run_id.
Customer's run_after_ota: true diagnostic itself has a bug, fails foreverThe first failure rolls back. The rollback may fail too (loop guard catches it), and three failures within 24h tripping the storm-disable. After that the platform stops cycling and surfaces the banner.
Project has zero run_after_ota: true diagnosticsBehavior unchanged from pre-v0.3.0: gateway flips straight from pending to applied on AppliedReport success. No verifying step.

Re-enabling after a storm

Two paths:

  1. Auto. The verification_storm_disabled_until timestamp lifts auto-rollback after 24 h. The next verification failure will roll back as normal.
  2. Manual. The verification panel banner has a Re-enable auto-rollback button. Click it to lift the flag immediately. You'll typically do this after fixing whatever was producing the bad releases (a buggy diagnostic, a regression in firmware, a flaky sensor on a known device).

The storm-disable doesn't disable verification itself — your run_after_ota diagnostics still run on every OTA, and the dashboard still shows the per-gateway result grid. Only the automatic rollback step is paused.

See also