AI Coding Agents: The Verification Loop
Notes on using AI coding agents with a verification loop so speed does not turn into hidden reliability debt.
Fast generation is useful. Blind acceptance is expensive.
I treat AI-generated code as a draft until proven otherwise.
That framing keeps the useful part without pretending the model owns production risk. AI can get a patch from zero to plausible very quickly. The engineering job is to turn plausible into safe, explainable, and observable.
Minimal loop I rely on
Generate the patch. Ask "what can break?" Test the contracts and edge paths. Add observability where failure would hurt. Merge only when the behavior is explainable.
No loop, no trust.
Where the loop pays off
The loop matters most when the code touches authorization, permissions, money, billing, credits, inventory, async jobs, retries, idempotency, public APIs, backwards compatibility, data migrations, destructive operations, or visible user state.
For these areas, a working happy path is not enough. The patch needs to prove what happens when inputs are missing, state is stale, a dependency times out, or the same action runs twice.
What usually slips through
The things that slip through are boring and painful:
- happy-path-only logic
- subtle contract mismatch
- thin error handling
- retries and timeouts nobody thought through
- code that looks fine in review but is hard to operate
Example: The Dangerous One-Line Fix
Suppose an agent updates a retry loop from three attempts to unlimited retries because a flaky integration keeps failing. The diff may look small and even reasonable. But the real questions are operational:
- Can the downstream system handle the retry volume?
- Is the operation idempotent?
- Does the queue have a dead-letter path?
- Will support see duplicate customer actions?
- Which metric tells us the retry loop is now unhealthy?
The verification loop forces those questions before merge instead of during an incident.
PR gate that keeps quality up
Before merge, the author should answer four questions:
- What invariant must hold?
- Which test checks it?
- What is still untested?
- How will production tell us it is broken?
If these are unclear, the patch is not done.
Test map I like
| Risk | Verification |
|---|---|
| contract mismatch | unit or contract test around input/output shape |
| hidden branch | edge-case test for missing, empty, stale, or invalid state |
| unsafe side effect | idempotency test or explicit approval path |
| production blind spot | log, metric, or trace tied to the failure mode |
| rollback pain | migration/release note explaining safe rollback |
This does not mean every AI-generated patch needs a giant test suite. It means the verification should match the risk introduced by the patch.
Human review still matters
AI coding agents are good at producing local coherence. Humans are still better at noticing system-level weirdness: an abstraction that does not fit the codebase, a dependency that creates operational drag, or a shortcut that violates a team standard nobody wrote down.
The healthiest workflow is not "AI writes, human rubber-stamps." It is "AI drafts, human interrogates, tests lock the behavior down."
A small team ritual
One ritual I like is asking the agent for a failure review before asking humans to review the patch:
List the top five ways this change could break production, then map each one to a test, log, metric, or manual review step.
The answer is not automatically correct, but it changes the review posture. The team starts from risk, not from diff aesthetics. That is especially useful when the patch is large enough that reviewers may skim.
I am happy to let AI make the first draft faster. I am not willing to let it remove the part where the team proves the change is safe.