AI Coding Agents: The Verification Loop

Notes on using AI coding agents with a verification loop so speed does not turn into hidden reliability debt.

Fast generation is useful. Blind acceptance is expensive.

I treat AI-generated code as a draft until proven otherwise.

That framing keeps the useful part without pretending the model owns production risk. AI can get a patch from zero to plausible very quickly. The engineering job is to turn plausible into safe, explainable, and observable.

Minimal loop I rely on

Generate the patch. Ask "what can break?" Test the contracts and edge paths. Add observability where failure would hurt. Merge only when the behavior is explainable.

No loop, no trust.

Where the loop pays off

The loop matters most when the code touches authorization, permissions, money, billing, credits, inventory, async jobs, retries, idempotency, public APIs, backwards compatibility, data migrations, destructive operations, or visible user state.

For these areas, a working happy path is not enough. The patch needs to prove what happens when inputs are missing, state is stale, a dependency times out, or the same action runs twice.

What usually slips through

The things that slip through are boring and painful:

happy-path-only logic
subtle contract mismatch
thin error handling
retries and timeouts nobody thought through
code that looks fine in review but is hard to operate

Example: The Dangerous One-Line Fix

Suppose an agent updates a retry loop from three attempts to unlimited retries because a flaky integration keeps failing. The diff may look small and even reasonable. But the real questions are operational:

Can the downstream system handle the retry volume?
Is the operation idempotent?
Does the queue have a dead-letter path?
Will support see duplicate customer actions?
Which metric tells us the retry loop is now unhealthy?

The verification loop forces those questions before merge instead of during an incident.

PR gate that keeps quality up

Before merge, the author should answer four questions:

What invariant must hold?
Which test checks it?
What is still untested?
How will production tell us it is broken?

If these are unclear, the patch is not done.

Test map I like

Risk	Verification
contract mismatch	unit or contract test around input/output shape
hidden branch	edge-case test for missing, empty, stale, or invalid state
unsafe side effect	idempotency test or explicit approval path
production blind spot	log, metric, or trace tied to the failure mode
rollback pain	migration/release note explaining safe rollback

This does not mean every AI-generated patch needs a giant test suite. It means the verification should match the risk introduced by the patch.

Human review still matters

AI coding agents are good at producing local coherence. Humans are still better at noticing system-level weirdness: an abstraction that does not fit the codebase, a dependency that creates operational drag, or a shortcut that violates a team standard nobody wrote down.

The healthiest workflow is not "AI writes, human rubber-stamps." It is "AI drafts, human interrogates, tests lock the behavior down."

A small team ritual

One ritual I like is asking the agent for a failure review before asking humans to review the patch:

List the top five ways this change could break production, then map each one to a test, log, metric, or manual review step.

The answer is not automatically correct, but it changes the review posture. The team starts from risk, not from diff aesthetics. That is especially useful when the patch is large enough that reviewers may skim.

I am happy to let AI make the first draft faster. I am not willing to let it remove the part where the team proves the change is safe.