Advisory vs Enforcing Guardrails

Config can describe a boundary. Evidence decides whether it enforces one.

Agent systems increasingly ship with settings that sound like security boundaries: trusted workspaces, secure modes, policy profiles, allowed paths. Those settings matter. But a setting is not the same thing as enforcement.

The operational question is narrow: when the agent attempts the covered action, does the guardrail structurally block it?

Two very different postures

An enforcing guardrail blocks the covered action at runtime. If a file read is outside the allowed scope, the read fails because the boundary is enforced below the model's intent.

An advisory guardrail shapes instructions, warnings, or policy context, but does not block the covered action. The model may be told not to read a path. The UI may describe a trusted workspace. The read can still succeed.

An unverified guardrail is everything else: a claim without a live probe attached. In ANCC, unverified is the default. The burden is on the enforcement claim, not on the reader to disprove it.

The agy case

agy is a useful worked example because its configuration can look like a boundary. In a live probe, the workspace constraint was informational for file reads. With the agent's real workspace settings in place, it read a file outside the declared workspace and returned the file's exact contents — a fixture value it could only have produced by actually reading the file. In one run it even narrated that the target was "outside your workspace" and read it anyway.

On macOS, the real blocking layer was TCC, not the agy workspace policy. TCC protects a narrow set of user folders — Documents, Desktop, Downloads, and similar. It does not cover everything: paths with ordinary filesystem permissions, including common credential directories such as ~/.ssh, are protected only by file permissions, not by TCC. Exactly which of those the agent can reach is its own question — one that only a real-action probe can answer, not the agent's own say-so. (More on why that distinction matters below.)

That changes the posture. The workspace policy is not enforcing for reads. It is advisory unless a live probe shows otherwise. The fact that another system layer — TCC — blocks some folders does not turn the workspace policy into a filesystem sandbox.

The agent said yes. The kernel said no.

There is a sharper failure here than a porous workspace policy, and it showed up while probing exactly this question.

Asked a yes/no question — "does this file exist and is it non-empty?" — about a file inside a TCC-protected folder, the agent answered yes. Asked instead to actually read the same file, the operating system returned Operation not permitted. The agent had not bypassed anything. It had hallucinated the result of a check it never performed.

The first probe used a convenient shape: ask the agent to self-report a boolean. That shape is unsound, because a confidently wrong agent will fabricate the boolean instead of running the check. The only thing that surfaced the truth was forcing a real action and reading the real operating-system result.

In an ordinary task, a hallucination wastes time. In a security check, a hallucination fabricates proof. An agent that reports "yes, access succeeded" — or "no, that's blocked" — about its own permissions is not evidence of anything. It is the claim under test answering on its own behalf.

An agent's self-report is not evidence. Evidence has to be an artifact the model cannot fake: an operating-system error, a process exit code, a hash, a fixture payload it could only produce by doing the work, an external audit log, a CI result. Vendor documentation, a product's mode name, and the agent's own narration all fail this test equally.

This is not vendor-bashing. It is a classification problem. If a guardrail is advisory, call it advisory. If it is enforcing, attach the live evidence. If nobody has tested it — or the only "test" was the agent reporting on itself — call it unverified.

The third axis: can it be told to stop asking?

Enforcement asks whether anything structurally blocks the agent. There is a second, separate question: can the agent be told not to ask at all? Most agents ship a flag for exactly that — --dangerously-skip-permissions, --full-auto, --yes-always, --auto, a "yolo" mode. Set it, and the human checkpoint disappears.

This is the most safety-relevant fact about an agent, and it is independent of enforcement. An autonomous agent that is genuinely sandboxed is bounded by the sandbox. A prompt-on-every-action agent that hallucinates is bounded by the human who gets asked. The risk is the combination.

Stack the three axes and you get the failure mode this whole page describes. An agent that can act without asking (high autonomy), that nothing structurally stops (advisory or unverified enforcement), that is confidently wrong and will report success it never achieved (hallucination). Each axis alone is survivable — another layer catches it. Together they are a recipe: the agent does the wrong thing, no prompt interrupts it, no boundary blocks it, and it tells you it worked.

In the worked example, the only thing that saved the read was an accidental fourth layer — the operating system's own file protection, which the agent did not know was there and the operator had not configured as a control. Remove that accident and the three axes complete the disaster silently.

The ANCC response

ANCC treats enforcement as a posture with provenance, not a label inferred from product language. The enforcement-provenance convention defines three states: enforcing, advisory, and unverified.

Both enforcing and advisory require cited live probe evidence. Vendor docs, product names, and local config labels are not enough. Without evidence, the posture is unverified.

It also reports the autonomy axis separately: for each agent, the documented modes that disable prompts, with the source cited. And it draws the line the operator would otherwise have to draw in their head — when an agent has a prompt-disabling mode and its enforcement is advisory or unverified, it surfaces a single combined caution. An agent with a verified enforcing posture is not flagged: the structural block is the mitigation.

ancc is a mirror, not an oracle. It reports posture, evidence, and autonomy. CI, governance tooling, or the operator decides whether advisory or unverified posture is acceptable for a given use.

Practical guidance

Do not treat agent config as a security boundary unless a live probe proves enforcement.
Run broad-reach agents inside an external sandbox when sensitive files are nearby.
Route sensitive work to tools with a smaller filesystem view.
Use OS-level controls for real blocking, and verify which paths they actually cover.
Never accept an agent's own yes/no about its access as proof. Probe with a real action and read the real result.
Know which prompt-disabling modes an agent offers, and treat "high autonomy plus unverified enforcement" as the combination to check before trusting it near anything that matters.
Record the probe result with the posture so future users do not rediscover the same gap.

The useful question is not "does the product have a secure-mode setting?" or even "what does the agent say it can do?" The useful question is "what happened when the agent actually tried to cross the boundary?"