Insights / The generation token: verifying a dataset that races its own reload

The generation token: verifying a dataset that races its own reload

A completeness check that runs for tens of minutes can collide with the next scheduled reload and report a false 'everything is missing.' An immutable token stamped at load time makes long verification safe.

Published

June 2026

Length

3 min read

Topics

Architecture · Azure · Distributed Systems

A data load and the job that verifies it are usually written as if they take turns. Load finishes, verification starts, verification finishes, repeat. Real systems don't take turns. Rate limits stretch a verification pass into tens of minutes, schedules overlap, and sooner or later the next load fires while the last verification is still running. If you didn't design for that overlap, the result isn't a small glitch — it's a false alarm at full volume.

The collision

Picture a pipeline that truncates its target and reloads it, then a separate verifier that crawls the source confirming every record made it. The verifier is slow because the source is rate-limited — three requests a minute. Halfway through its pass, the scheduled reload fires and truncates the table out from under it. Every check the verifier now makes finds nothing, because the table is mid-rebuild. It concludes, with total confidence, that the entire dataset is missing — and kicks off a back-fill storm against data that's actually fine and already being reloaded.

The verifier wasn't wrong about what it saw. It was wrong about which load it was looking at.

Bind the work to a generation, not a clock

The fix is to give every load an identity and make the verifier loyal to one. Stamp an immutable LoadId — a generation token — inside the same transaction that performs the truncate. The verifier captures the current LoadId when it starts, and before every continuation and every corrective write, it re-reads the token. If it changed, a newer load has superseded this one; the verifier marks itself Superseded and stops. It never writes against a generation it didn't begin with.

Long-running asynchronous work should be loyal to the generation it started with, not to wall-clock time. Stamp the generation at the source, check it before every side effect, and stand down when it moves.

This is the same idea as a fencing token in distributed locking, applied to data freshness instead of mutual exclusion. The point isn't to prevent the overlap — overlap is fine and often unavoidable. It's to make the stale worker recognize itself as stale before it does damage.

Fire-and-forget needs a watchdog

The generation token kills the false alarm. It doesn't solve the other failure mode of these pipelines: the dropped hand-off. Self-chaining functions continue by invoking the next link over HTTP, and a common helper treats a timeout on that call as "probably started, moving on." Usually true. Occasionally the next link never actually ran, and the chain just stops, mid-stream, with no error anywhere.

Persisting in-progress state — a row marked InProgress, a non-terminal status — makes recovery possible. It does not make recovery happen. Something has to come along, notice the stalled generation, and resume or restart it. That something is a watchdog: a timer that scans for non-terminal work older than it should be, and kicks it. Without one, a dropped hand-off waits until the next scheduled run to self-heal — which on a biweekly job is a long time to be quietly broken.

The short version

Overlap between a load and its verifier is inevitable; design for it. Stamp an immutable generation token in the truncate transaction, have the verifier re-check it before every write and stand down if it moved, and add a watchdog that actively resumes stalled non-terminal work. Persisted state makes recovery possible — the watchdog is what makes it real.

All insights Start a project