Post-OTA verification
The keystone feature. Diagnostics are useful on their own — operators can run them on demand — but the moment you set run_after_ota: true on one, you've turned it into a release-gate. The platform automatically runs it on every device after every successful OTA, aggregates the results, and rolls back the release if any device's check fails.
This page covers the full lifecycle, the auto-rollback behavior, and the safety mechanisms that keep a buggy diagnostic from cascading into a fleet-wide outage.
The lifecycle, end to end
release.applied (per gateway, fired by service-app on AppliedReport success)
│
▼
service-app queries diagnostic_definitions WHERE run_after_ota = true
│
├── (zero results) ─────► gateway flips to "applied" (current behavior; no verify step)
│
└── (one or more)
│
▼
service-app mints run_id, inserts pending diagnostic_runs rows
service-app publishes one cmd per diagnostic to the gateway
gateway flips to "verifying"
│
▼
chip dispatches each diagnostic, publishes results back on
{ns}/{gw}/diagnostics/result with envelope v2 (run_id + triggered_by="ota_verify")
│
▼
service-app result-consumer inserts each result
after each insert: count pending; if 0, decide:
│
├── any failed/timeout/error ─► gateway flips to "verification_failed"
│ │
│ ▼
│ auto-rollback fires (see below)
│
└── all pass ────────────────► gateway flips to "verified"
│
▼
code_project.current_release_id is set
(the "in production" badge flips green)
A separate reaper runs every 30 s in service-app. It finds rows still pending past max(timeout_secs) * 1.5 and flips them to timeout, which triggers the same finalize check. Wall-clock cap with the largest allowed timeout_secs=300: verification gives up after 7.5 minutes per gateway.
Per-gateway granularity, per-release rollup
Each gateway's verification is independent. They share a release_id but each has its own run_id, its own diagnostic_runs rows, and its own status transition.
The release-level status flips to verified only when all applied gateways reach verified. The release flips to verification_failed the moment any applied gateway reaches verification_failed. So a 1,000-gateway rollout doesn't have to wait for everyone to finish — the bad release self-aborts on the first failure and the rest of the fleet stops getting it.
Auto-rollback fans out: when the release-level state flips to verification_failed, the cloud rolls back every gateway that received the release, not just the ones that failed verification. This is the point of the feature — you don't want some gateways stranded on a known-bad release.
Auto-rollback
Always-on by default. When verification fails on a gateway:
- The cloud looks up that gateway's previous successfully-verified release.
- It re-pins the gateway to that release with
force=true, fanning out a freshrelease.apply. - It emits
gateway.release.auto_rolled_backto the audit log (severity 3). - The dashboard surfaces a banner on the gateway and on the release detail page: "Auto-rolled back gateway X to release Y due to verification failure."
The previous release was already verified at last-apply, so the platform trusts it. The chip downloads + installs again, re-runs its run_after_ota diagnostics for the previous release (yes, even on rollback), and lands back in verified.
Loop guard
If the rolled-back release also fails verification, the platform does not cycle further. Instead:
- The gateway's release status flips to
verification_storm. - An audit event
gateway.release.verification_stormis emitted. - A red banner appears in the gateway detail page.
- Manual operator intervention is required to re-enable auto-deploys for that gateway.
The loop guard exists because there are real failure modes (a sensor died after being installed; a firmware change made an existing sensor's reading invalid) where every release in your history would fail the same check on the same device. Without the guard, the platform would cycle forever between two releases. With it, the platform halts loudly the second time and asks for human input.
Storm protection
If a project hits 3 consecutive verification failures across distinct releases within 24 hours, the platform auto-disables the project's auto_rollback_on_verification_failure setting. This is the global kill-switch — once it trips:
- Future verification failures still flip releases to
verification_failed(the diagnostic information is preserved). - The cloud no longer auto-rolls back. The fleet stays on whatever's running.
- An audit event
project.auto_rollback.storm_disabledis emitted. - A banner appears on the project's release page asking the operator to investigate.
- A
verification_storm_disabled_untiltimestamp is set; auto-rollback re-enables 24 h later, OR the operator can hit the Re-enable button in the verification panel banner immediately.
This is the safety net behind the "always-on" default. It exists for cases where you wrote a diagnostic that's flaky in a way you didn't catch in testing — without it, three buggy releases in a day would cycle every device on the project. With it, the platform stops auto-acting after the third failure and forces you to look at what's happening.
State summary
The release status state machine, including the new states this feature adds:
requested ──► applied
│
├─► (no run_after_ota diagnostics) ───────► (terminal: applied)
│
└─► verifying ──► verified ───────────────► (terminal: verified)
│
└─► verification_failed ────► auto-rolled back to previous
│
└─► (rollback also fails)
│
▼
verification_storm
(terminal; manual recovery)
verifying, verified, verification_failed, and verification_storm are the new states. Projects with zero run_after_ota: true diagnostics never see them — the gateway flips straight from pending to applied as it always has.
What you control vs. what the platform handles
You control:
- Which diagnostics run after OTA (the
run_after_ota: truefield per YAML). - The
timeout_secsper diagnostic (max 300; the verification window ismax(timeout_secs) * 1.5). - Whether auto-rollback is enabled per project (defaults true; your namespace settings expose it).
- When to re-enable after a storm-disable (button in the verification panel).
The platform handles:
- Fanning out one
diagnostic.runcmd per gateway perrun_after_otadiagnostic on every apply. - Aggregating the per-gateway results into the per-release verdict.
- Maintaining the loop guard and storm-disable counters.
- Rolling back automatically (when enabled) without operator action.
- Surfacing every state transition in the dashboard verification panel and the audit log.
Edge cases
| Case | Behavior |
|---|---|
| Gateway is offline at OTA time | The verification hook only fires after RecordApplied succeeds, which only fires when the chip publishes its AppliedReport. Offline chips queue the OTA notification at the broker; verification starts when they come back online and successfully apply. |
| Gateway applies but disconnects before publishing diagnostic results | Pending rows time out per the reaper window above. The release transitions to verification_failed and auto-rollback fires. The chip will get the rollback whenever it next reconnects. |
| New release promoted while verification is in flight on a previous release | Allowed. The new release supersedes the in-flight verification — pending verify rows are abandoned (status flipped to superseded), and verification starts fresh against the new release once it applies. |
| Operator hits "Retry promote" during verification | Same as fresh promote — supersede + restart. The chip is on the right release; the cloud re-publishes verification cmds with a new run_id. |
Customer's run_after_ota: true diagnostic itself has a bug, fails forever | The first failure rolls back. The rollback may fail too (loop guard catches it), and three failures within 24h tripping the storm-disable. After that the platform stops cycling and surfaces the banner. |
Project has zero run_after_ota: true diagnostics | Behavior unchanged from pre-v0.3.0: gateway flips straight from pending to applied on AppliedReport success. No verifying step. |
Re-enabling after a storm
Two paths:
- Auto. The
verification_storm_disabled_untiltimestamp lifts auto-rollback after 24 h. The next verification failure will roll back as normal. - Manual. The verification panel banner has a Re-enable auto-rollback button. Click it to lift the flag immediately. You'll typically do this after fixing whatever was producing the bad releases (a buggy diagnostic, a regression in firmware, a flaky sensor on a known device).
The storm-disable doesn't disable verification itself — your run_after_ota diagnostics still run on every OTA, and the dashboard still shows the per-gateway result grid. Only the automatic rollback step is paused.
See also
- Config schema: how to set
run_after_ota: true. - C API: the firmware side of every diagnostic that runs in this loop.
- Dashboard walkthrough: where to see the verification panel and the storm banner.
- OTAs and rollouts: the underlying OTA mechanism that this feature builds on.
Updated 1 day ago
