The Cheaper Half of Oversight
A few weeks into running Compound, I delegated a research synthesis task and approved the plan without really reading it. The agents ran, I got a polished output, and it was wrong in a way I couldn’t fix by editing. The scope had been set on a question I hadn’t actually asked — close enough that I could see what the system had inferred, far enough off that the output was useless for the decision I needed to make. No amount of revision was going to fix it. The work had to be done again, from the right starting point.
That’s when I understood what plan review was actually for.
What the Plan Actually Shows
The supervision instinct most people bring to AI is basically correct.
Read what came back. → Check whether it’s good. → Push back if it isn’t.
That sequence is reasonable — it’s roughly how you’d review work from a human collaborator too. The instinct to supervise is right. The timing is wrong.
When you review an output, you’re inspecting something that has already been built. The scope has been chosen, the agents have run, and the sequence has played out. You can edit the result. What you can’t do is redirect the framing, catch a wrong agent choice, or stop a well-executed task from going in a direction you didn’t actually want. That work is done.
The plan review happens earlier, when none of those decisions are locked in. In Compound, before any specialist agent runs, the project-manager agent produces a delegation plan: which specialists will work on the task, in what order, what each one will produce, and what depends on what. That plan is reviewable — and at that moment, every element of it is still free to change.
Here’s a representative example — the kind of plan project-manager produces for a substantive engineering task:
Task: Add semantic search over the product’s document library
Phase 1 — Spec & design
spec-generator: Turn the request into a spec with acceptance criteria, including what “good search” means for this library’s content types and expected queries. Deliverable: spec with acceptance criteria. No dependency.software-architect: Design the retrieval approach — embedding model selection, chunking strategy, and vector store — and produce a phase plan for the build. Deliverable: architecture design and phase plan. Depends on:spec-generatoroutput.
Phase 2 — Build
ai-engineer: Implement the embedding and retrieval pipeline per the architecture design. Deliverable: working semantic search over the document library. Depends on:software-architectoutput.
Phase 3 — Review
code-reviewer: Review the implementation for correctness, edge cases, and alignment with the architecture design. Deliverable: review findings and approval or revision requests. Depends on:ai-engineeroutput.
Critical path: spec-generator → software-architect → ai-engineer → code-reviewer
Risk: If the document library spans multiple content types or languages, the embedding model choice may need revisiting before the build phase. Flag for user decision if the spec surfaces significant content heterogeneity.
Reading a plan like this, a reviewer can catch something the plan itself doesn’t flag: the sequence moves directly from architecture into building the retrieval pipeline, with no step to verify that the existing library content is clean and structured enough to embed well. That’s a content-quality assumption baked silently into the design.
A single sentence of redirection — ask a data-engineer to assess the content before the build phase — changes the sequencing before ai-engineer has written a line. Skip it, and the build produces a polished retrieval layer over content too inconsistent to search effectively. That failure surfaces as bad search results after the work is done. Output review does catch it. What it can’t do is repair it cheaply because the edit hits the wrong unit. You can tune the retrieval logic indefinitely, but the defect lives in the content, and reaching it means rebuilding from an earlier stage.
The Cost Difference Between Plan Revision and Output Revision
There’s a structural way to understand why upstream review matters: the cost of a change is not constant across the workflow.
Changing a plan costs a sentence. You redirect the scope, resequence the agents, or narrow what each agent is asked to produce before any work begins. The change propagates forward through a system that hasn’t run yet. Nothing has to be undone.
Changing an output costs considerably more. You’re working against completed work — revising it, discarding it, or rebuilding parts of it. Worse, if the framing was wrong at the plan level, a framing error sits earlier in the workflow than any edit can reach. The output is the symptom; the decision that produced it has already been made.
This isn’t an argument for endless planning. For small, fast tasks where re-running from a corrected prompt costs almost nothing, the output-and-rerun loop is genuinely sufficient — and lighter than reading a delegation plan every time. Plan review earns its overhead when execution is expensive or multi-agent, or when the framing error is the kind that only becomes visible by reading the plan — because a vaguely-corrected re-run will hit the same buried assumption again. It’s an argument for catching scope and framing errors when they’re cheapest to catch — which is before execution, not after. That’s what plan review is for.
Why the Question Changes
Most AI oversight asks one question: is this output good enough to use? That’s a quality check, and it’s useful. But it’s only half the oversight problem.
The other half is a scope question: was this the right work to do? A flawlessly engineered retrieval layer over content too messy to search answers the quality question well and the scope question poorly. Output review will eventually surface that failure — bad results make the scope mismatch visible. But by the time you’re reviewing what was built, the scope is baked in. The plan review is the moment when that question is still answerable, because the plan makes the scope visible before any work has been committed to it.
What This Design Requires
Building a plan review into an agentic workflow isn’t automatic. It requires that the system actually produce a plan before executing — and that the plan be detailed enough to be reviewable, not just a high-level summary. A plan that says “research, then strategy, then writing” is a gesture, not a delegation document. A plan that names the specialist agents, defines each deliverable, maps the dependencies, and flags the decisions that belong to the human is something you can actually evaluate.
It also requires that the human show up to the review. This sounds obvious, but it’s the friction point where many workflows fail. The plan review only moves judgment upstream if the person doing the reviewing actually reads the plan and exercises judgment — pushing back on scope, redirecting agents, narrowing artifacts, or approving the plan as-is with a clear understanding of what they’re authorizing. Rubber-stamping the plan is just slower output review. I know this because that’s how I got the wrong output in the first place.
When both conditions are met — a plan worth reviewing, and a reviewer who actually reviews it — the execution quality improves in ways better agents alone can’t replicate. The agents run on a well-scoped task. The outputs answer the right questions. The iteration happens at the cheapest moment in the workflow.
The plan review is one mechanism in a larger question: where should human judgment enter an agentic workflow, and in what form? Once execution begins, the question changes — it’s less about scope and more about whether the work is being done accountably. That’s territory worth its own post.