Who Authorized That Decision?
Who in your organization decided that the AI could approve that expense?
Not “who set up the system” — that’s the engineering or data team. Who made the organizational decision that this class of policy judgment could be delegated to a model? Who signed off on what the AI was allowed to conclude, under what conditions, with what accountability if it got it wrong?
For most organizations running AI on policy-adjacent workflows, the honest answer is: nobody. The decision wasn’t made. It happened by omission.
The gap between pointing and enforcing
The omission usually takes the same form. There is a version of this that every organization believes it has handled. The policy document is in the knowledge base. The system prompt says “follow company policy.” The AI was tested against the handbook and answered correctly. That feels like enforcement. It isn’t.
A language model can read your expense policy and quote it back accurately. In a demo, it usually applies it correctly too. The problem surfaces at the edges.
In production, the model interprets. It reads the specific request in front of it, retrieves relevant context, and produces an answer that reflects its best probabilistic judgment. When the case is clean and the context is complete, the answer is often correct. When the case is slightly unusual — someone cites a verbal approval, a policy is silent on a specific circumstance, or two rules appear to conflict — the model reasons through the gap.
This is not a complaint about language models — it is a description of what they are. The problem is that organizations point a probabilistic interpreter at their policy and mistake fluent output for authorized enforcement. Those two things are not the same, and the gap between them is where unauthorized decisions accumulate.
The decisions that carry real authorization weight
Not every AI decision is equally consequential. A model that summarizes an intake request, drafts a customer message, or extracts structured fields from a document is doing interpretive work where probabilistic quality is acceptable and reviewers can catch exceptions.
The risk is in a different category: decisions with actual policy weight, where the answer is not a matter of quality or nuance but of what the organization has already decided. Who qualifies for a benefit exception. Whether a transaction requires escalation. What a claimant is entitled to. Whether a submitted expense is reimbursable.
These questions have correct answers — answers derivable from explicit rules that the organization has already established. When you route them through a language model without explicit logic, you are not just accepting some error rate. You are delegating policy-making authority to a system that was never formally given it, in a way that produces no record of when or where that delegation happened.
What the improvisation looks like
A finance team deploys an AI assistant to help review expense submissions. The model is prompted with the expense policy. It performs well in testing.
Six months in, someone notices a pattern: a category of client entertainment expenses requiring manager pre-authorization above a certain dollar threshold has been clearing automatically. Employees had been submitting with notes like “approved verbally” or “standing client relationship,” and the model had been interpreting those notes as satisfying the intent of the policy.
The policy rule was not missing from the knowledge base. The model could quote it. What was missing was explicit logic: if threshold exceeded and no documented authorization exists, route for human review. That condition was in the policy document. It was not in the system. The gap was filled by model judgment — consistently, at volume, for six months — and nobody had authorized that.
This is the failure mode that is hardest to see because it does not look like a failure. The system is running. The outputs are plausible. The edge cases do not generate errors. They generate reasonable-sounding answers. The organization discovers the gap the same way it usually discovers quiet failures: when someone looks at a pattern of outcomes and realizes decisions were being made that no one had approved.
The fix is not a better prompt
When this surfaces, the reflex is to update the prompt. Make the rule more explicit in the system message. Add a more specific instruction. Sometimes that suppresses the symptom for a few weeks, until the next edge case the updated prompt did not anticipate.
The underlying issue is that a policy-weight decision is in the wrong layer. Decisions where the organization has already determined the correct outcome — where the question is whether the rules are being applied, not what the rules mean — belong in code. Actual conditional logic: if this condition, then this outcome. Written, tested, auditable, and owned by someone.
This is not an argument for rigid, rules-only systems that cannot handle complexity. Language models handle ambiguity and nuance well, and there is real value in giving them interpretive latitude where interpretation is what is needed. The design question is where interpretation ends and enforcement begins. Reading a request and understanding what the user is asking — that is interpretation. Deciding whether the outcome they are asking for is authorized — that is enforcement.
When organizations conflate them, they have not built an AI system with good judgment. They have built a very articulate way of making policy decisions that no one vetted, in a system that keeps no record of when the model deviated from actual policy.
Two questions worth asking this week
For any AI system your organization runs that touches policy decisions — expense review, benefits administration, compliance routing, and approval workflows — two questions surface the exposure.
First: Where does the policy live in this system? If the answer is “the model is instructed to follow our policy,” that is a description of a prompt, not an enforcement mechanism. A prompt is an intention. An enforcement mechanism is logic with explicit conditions, explicit outcomes, and a record of every evaluation.
Second: What happens when the model encounters a case the policy did not anticipate? In a well-designed system, there is an explicit path: escalate, hold, or route to human review. In most systems built around a prompted model, what happens is that the model figures something out. That answer becomes the effective policy for that class of case — invisibly, without organizational review, with no one having decided that the model was authorized to make that call.
The logic layer is not the interesting part of an AI system. It does not show well in demos. But it is the part that determines whether your organization has actually made a governance decision, or whether it has simply let one accumulate.
Making that decision deliberately — and designing the system architecture that enforces it — requires understanding how the logic layer interacts with interpretation, context, measurement, and human judgment. Collaborative AI covers that full picture for teams building AI systems they can actually trust in production.