Agent Evals and Guardrails
How I think about improving production reliability for AI agents with evals, policy guardrails, and escalation paths.
I no longer treat evals/guardrails as platform extras. For agentic products, they are part of core feature quality.
The shift happens when the agent can affect a customer, a workflow, or a system of record. At that point, "the model is usually right" is not a launch bar. The product needs a way to prove expected behavior before release and a way to constrain behavior during runtime.
The split
Evals check capability and correctness. Guardrails enforce policy boundaries at runtime.
One without the other creates failure modes.
Evals without guardrails can tell you the agent failed after the fact. Guardrails without evals can block obvious bad actions while still letting the agent make low-quality decisions. Production reliability needs both loops.
What I want evals to cover
I want evals for normal tasks, edge cases, adversarial or prompt-injection style cases, and regressions from real incidents.
If a scenario can break in production, it should exist in evals.
How I split eval cases
I prefer four buckets. Golden path means the agent should complete the task with minimal ambiguity. Ambiguous path means the agent should ask for clarification instead of guessing. Unsafe path means it should refuse, escalate, or request approval. Regression path means it should not repeat failures already seen in logs, support tickets, or incident reviews.
This keeps the eval set from becoming a happy-path demo. It also makes failures easier to triage: capability issue, instruction issue, policy issue, or missing context.
Guardrails I usually enforce
I usually enforce tool allow/deny policies, scoped data access, destination and action restrictions, approval gates for high-risk actions, and a full audit trail.
Example: Calendar Agent
A calendar agent that reads availability is low risk. A calendar agent that books external meetings, edits invitees, or sends follow-up notes has a different risk profile.
For that workflow, I would usually split actions by risk:
| Action | Guardrail |
|---|---|
| Read free/busy slots | allowed with scoped calendar access |
| Draft a meeting proposal | allowed, but mark as draft |
| Invite external attendees | require confirmation |
| Cancel or move a customer meeting | require explicit approval and audit log |
The point is not to slow every action down. The point is to make the riskier boundary visible.
Runtime shape that works
- Pre-check policy layer
- Agent execution layer
- Post-check validation layer
- Human escalation layer
- Feedback to eval set
Reliability improves when production failures become test cases.
Signals I watch after launch
After launch, I watch refusal rate by task type, human-approval rate by action type, tool-call error rate, rollback or correction count, new eval cases created from real failures, and whether escalations carry enough context for a human to decide quickly.
If none of these are measured, the team is relying on vibes. That might be fine for a prototype. It is too weak for a production agent.
Common anti-pattern
The anti-pattern I see is building a powerful agent first and adding evals only when stakeholders ask, "How do we know this is safe?" By then, the team has already shipped prompts, tools, and product flows that are hard to test separately.
I prefer starting with a small eval set before the first demo. Even 20 concrete cases force useful design choices: what the agent may do, what it must refuse, what it should ask a human, and what success actually means.
Trend signals behind this note
- OpenAI launched AgentKit on October 6, 2025 with first-party eval/trace/guardrail patterns: Introducing AgentKit.
- Stack Overflow 2025 data shows broad AI adoption while confidence/trust still lag in many workflows: AI section, 2025 survey.
Takeaway
If agents can take actions in production, eval and policy systems are part of the product, not optional infrastructure. Ship the guardrails with the feature, not as a cleanup project after the first scary failure.