What Actually Makes AI Work in Production

A thunderstorm line rolls across the Northeast at 5:40 p.m. Flights start slipping, then canceling.

By 7:15, an airline’s service operation is under real load. Customers are stranded in Charlotte, LaGuardia, and Boston. Some need hotel vouchers. Some just need the earliest route out. Others look simple until you notice the international connection, the special service request, or the fare rule that makes the obvious replacement invalid. The queue is growing faster than the service team can absorb it.

This is the kind of moment when an AI demo looks convincing. A model can read a disruption notice, interpret traveler intent, summarize options, and draft a calm, competent message. If you watch only that part, it feels like the problem is solved.

It isn’t. The answer will sound better than it is if the system lacks the full itinerary, current airport status, fare constraints, and partner-airline inventory. Give it that context without clear rules for what can be covered, what can be changed, and when the case needs a handoff, and it will still make bad calls. Add rules and context but no measurement, and the failure will spread quietly. Remove the human path for the cases that do not fit the pattern, and it will break exactly where the customer experience matters most.

That is the larger point. In production, AI is never the whole system.

The mistake underneath most AI projects

The first post argued that the tool was not the point. The second post argued that relevance and reliability are different problems. This is the next step. Reliability is not one thing.

Most organizations still evaluate AI as if the model carries the whole load. They ask whether it can interpret the request, draft the response, or recommend the action. Those questions matter, but they cover only one part of the problem. Once the environment gets real, everything around the model determines whether the output becomes useful work or expensive cleanup.

Travel rebooking makes this obvious because the stakes show up fast. A wrong answer is not a bad paragraph. It is an operational problem, and sometimes a public one. It can leave a high-value customer telling the world your “AI-powered service” left them stranded.

When teams treat the model as the product, they invest in the visible part. The work goes into prompts, retrieval, and more polished responses. That can help, but only within the boundaries the rest of the system defines. A well-worded answer cannot compensate for a missing business rule. A retrieved paragraph cannot replace a routing decision.

For a system like this to work, five different jobs have to be done well: interpretation, boundaries, context, measurement, and judgment.

First, something has to handle interpretation

Something has to read the inbound message, infer what the traveler is actually asking, and turn a messy situation into a structured next step.

“My daughter and I are stuck in Charlotte and the app keeps sending me in circles” is not a database query. It is a real request wrapped in frustration, ambiguity, and implied urgency. This is where generative AI earns its keep. It can turn that mess into something the rest of the operation can act on.

This is the flexible layer. It is not the whole system.

If the flight is canceled because of weather, the model can explain the options well. It still cannot decide what the airline owes, what itinerary changes are allowed, or when the case has crossed into an operational exception. Those are not language problems. They are policy and workflow problems.

This is why so many teams overestimate what they have built. They see the fluency of the AI layer and assume the whole operation has become intelligent. Usually they have built a strong interpreter inside a weak system.

Second, something has to enforce boundaries

In a real rebooking system, the most important decisions are often the least visible.

The questions here are not flashy, but they decide whether the operation stays under control. Can this booking move? Does policy allow compensation? Is this still safe to automate, or does it need human review?

Those decisions should not live inside model improvisation. They belong in deterministic logic. This is the part many teams underbuild because it does not look like AI. It looks like software engineering, business-rule enforcement, and process design.

In the previous post, the distinction was relevance versus reliability. This is where reliability becomes operational. A reliable system knows what it is allowed to do, what it must not do, and when a case needs a different path.

Without that boundary layer, the model becomes a very articulate way to make unauthorized decisions.

Third, something has to supply context

Now imagine the system has strong logic but weak context.

The rules are correct, but the underlying facts are stale. The weather feed lags. Inventory changed five minutes ago. Loyalty status has not refreshed. The model and the rule engine may both behave exactly as designed and still produce the wrong outcome because the world they are acting on has already moved.

This is part of what people mean when they talk about retrieval, but the issue is broader than that. The operation needs current, usable reality, not just documents. It needs the case history, the latest constraints, and the exceptions that make this traveler different from the default case.

Data makes the system situationally aware. Without it, the AI sounds informed while reasoning in a vacuum, and the rule layer applies clean logic to the wrong facts.

This is also why “we connected it to the knowledge base” is rarely enough. A handbook might explain compensation policy. It does not tell you that the customer is traveling with a lap infant, already missed one rebooked connection, and is now trying to reach a city with only two remaining same-day options.

If the system is starved of context, everything else becomes brittle.

Fourth, something has to measure outcomes

Suppose the AI is interpreting requests well. The rules are mostly sound. The context feeds are present. And still, as the storm rolls on, the queue keeps getting worse.

If nobody is watching outcomes, you will not know why. One metric improves while the real cost simply moves somewhere else. The queue looks faster, but leakage rises. Automation goes up, but so do supervisor escalations. The dashboard says efficiency improved while airport agents quietly route around the tool.

Most teams treat this like a reporting afterthought. It is what keeps the operation from drifting out of balance while everyone assumes it is fine.

Production systems need active measurement. You need to know where people override the tool, where time-to-resolution gets worse, where recovery fails, and where certain disruptions produce the same bad outcome over and over. If you do not measure those things, the operation can degrade while still looking healthy in a slide deck.

A lot of organizations discover this late. They monitor uptime and latency and think the system is fine. But a rebooking engine can be available, fast, and wrong in exactly the cases that drive real cost.

Fifth, someone still has to carry judgment

Some travelers cannot be reduced cleanly to policy and pattern.

Sometimes the case itself no longer fits the template. A family is split across bookings. A medical accommodation disappears during the disruption. A weather delay overlaps with a maintenance issue, and the standard compensation path stops matching what actually happened.

These are not failures of automation. They are the reason boundaries exist.

A strong system routes routine cases efficiently and exceptional cases intelligently. It does not try to flatten all judgment into the model. It gives human agents the context they need, the case history behind it, and a clear reason the handoff happened.

Trust usually does not break on the easy path. It breaks when a customer has a real problem and the system has no graceful way to admit that this case needs a human.

The human role is not decorative. It is where accountability, discretion, and empathy remain intact.

What this changes

Once you see the system this way, a lot of bad AI strategy starts looking obvious.

This view gives you a cleaner diagnosis:

If the model is weak, improve interpretation.
If the operation is making unauthorized decisions, strengthen the boundaries.
If the answers are detached from the case, fix the context.
If the business impact is unclear, invest in evaluation.
If edge cases keep stalling out, redesign the human path.

That is the operational value of thinking this way. It prevents the lazy diagnosis that every problem is “the AI needs to be better.” Often the AI is doing its job. Something else around it is not.

The travel example is just one scenario. The same pattern shows up anywhere AI has to operate inside real constraints. Claims, support, benefits, underwriting, and procurement all run into the same failure mode. When one of the five jobs is weak, the whole system degrades. When all five are done well, the result stops feeling like a demo and starts feeling dependable.

In my book, Collaborative AI, I develop this idea more fully and show how the pieces interact as a framework. For now, the practical point is enough: if you’re building for production, stop asking whether the model works. Ask whether the environment around it does.

What Actually Makes AI Work in Production

The mistake underneath most AI projects

First, something has to handle interpretation

Second, something has to enforce boundaries

Third, something has to supply context

Fourth, something has to measure outcomes

Fifth, someone still has to carry judgment

What this changes

More in Stop Thinking in Tools

The Difference Between Relevant and Reliable

Read Next

I Built a 25-Agent AI Operating System

Subscribe to The Algorithm