Insights / Full reload beats the delta you can't trust

Full reload beats the delta you can't trust

Incremental sync is the reflex. But it rests on a premise many source systems quietly fail to meet: that you can actually find out what changed — including what was deleted.

Published

June 2026

Length

3 min read

Topics

Integration · Azure · Data Engineering

Incremental sync is the default everyone reaches for. Pull only what changed since last time: less data, faster runs, lower cost. The instinct is right often enough that it's become reflexive. But it rests on a premise that a lot of source systems quietly fail to meet — that you can actually find out what changed.

The premise most APIs don't honor

A trustworthy change feed has to report three things: what was created, what was modified, and what was deleted. The first two are common. The third is rare. Plenty of APIs expose a "modified since" timestamp and no way at all to learn that a record was removed at the source.

Sync incrementally against a feed like that and your copy drifts. Deletions never propagate. Your database slowly fills with rows the source no longer believes in. And the drift is invisible — every individual run looks successful, because nothing errored. You don't discover the problem until someone asks why a number is too high and the answer is "we've been counting ghosts for three months."

When I hit exactly this on a recent pipeline — an API with no reliable delta and no deletion signal — the honest move was to stop pretending. We switched to truncate-and-reload: each run clears the target and pulls the full current set. It sounds heavier, and it is. It's also correct, which the incremental version was not.

If you can't trust the source to tell you what was deleted, you can't trust a delta. Reload the whole thing, and spend your cleverness on proving the reload was complete.

The effort doesn't vanish — it moves

Choosing full reload doesn't make the engineering easier; it relocates it. With a delta, the hard part is computing the right diff. With a reload, the hard part is proving the load actually landed everything — because a half-finished reload doesn't leave you with stale data, it leaves you with missing data, which is worse.

So completeness verification becomes the real design problem. And it has a trap that cost real time to see clearly.

Gaps in a key are sparsity, not loss

The intuitive way to check for missing rows is to look at the ID column: take the minimum and the maximum, and flag every integer in between that isn't present. It is wrong, and confidently wrong.

Natural keys are sparse. On one dataset of about a thousand rows, that method flagged 399 "missing" IDs — every one of them a number the source had simply never issued. The check would have launched an endless, futile back-fill chasing records that never existed.

Absent integers in a key range are not evidence of loss. They're evidence that keys aren't contiguous, which they almost never are. The only sound completeness check is against ground truth from the source itself: the count the API reports, or the actual set of IDs it returns — never a synthetic range you reconstructed from the endpoints.

When to keep the delta

None of this means incremental sync is wrong. When the source genuinely has a reliable change feed — real modification timestamps and real deletion signals: tombstones, a deletions endpoint, change-data-capture — incremental is the better design, and reloading everything is waste. The decision rule is just narrower than the reflex makes it: use a delta when the source can be trusted to report all three kinds of change, and full-reload-plus-verification when it can't.

The short version

Don't sync incrementally against a feed that can't tell you about deletions — you'll drift, silently, and look fine the whole time. Reload the full set instead, and move your engineering effort to proving the reload was complete. When you verify, check against the source's own count or ID set, never against a filled-in integer range: a missing number is almost always a number that was never issued, not a row you lost.

All insights Start a project