I Built a 25-Agent AI Operating System

I Built a 25-Agent AI Operating System

Most people doing serious knowledge work with AI still start in the same place: a blank chat window.

Then the ritual begins. Paste in context. Describe the task. Explain what good looks like. Correct the first draft. Repeat the whole process on the next problem.

The models are already good enough to be useful. The drag is elsewhere. Every meaningful use starts with setup cost: restating context, reloading standards, and steering the model back toward your preferences when it slides toward the average case.

I got tired of paying that tax, so I replaced it with something better: a personal AI operating system with 25 specialized agents, 117 composable skills, and persistent memory that carries forward what each agent learns about how I work.

I use it every day for actual work: code review, research, competitive analysis, writing, and planning. It is not a demo or a toy. It is the working environment I increasingly prefer over starting from scratch with a blank chat window.

What the system actually is

At the surface level, the whole thing is surprisingly simple. Each agent is a markdown file with YAML frontmatter. The frontmatter defines things like name, model, tools, and configuration. The body defines the agent’s role, standards, workflow, and constraints. That’s the unit.

The skills are separate markdown files. Each one defines a reusable capability or behavioral pattern: how to do competitive analysis, how to write in my voice, how to review code against a standard, how to structure a strategy memo, how to think through a financial decision. Agents reference those skills declaratively, so specialization compounds instead of getting rewritten from scratch every time.

Memory sits alongside that. Some memory is shared across the whole system. Some is agent-specific. Some is tied to a project. The effect is straightforward: when I refine how I like a particular kind of work done, the relevant agent doesn’t forget five minutes later.

The result is a team of specialists rather than one generic assistant pretending to be many things.

Why I built it this way

The obvious objection is simple: why not just use one strong model and write better prompts? I tried that version first. It breaks for a predictable reason. General-purpose assistants are optimized for broad usefulness. My work is not broad. It has preferences, standards, and recurring patterns that matter.

  • When I ask for competitive intelligence, I do not want a generic market summary. I want primary signals weighted above commentary, claims separated from evidence, and positioning analyzed in a way that is useful for decision-making.

  • When I ask for code review, I do not want a polite style pass. I want confusing names, leaky abstractions, and brittle logic treated as real defects.

  • When I ask for writing help, I do not want pleasant, high-variance prose. I want structure, compression, and a voice that sounds like me rather than like a serviceable AI assistant.

You can force a general assistant toward those standards with enough context. But you have to keep doing it, because the context decays, the setup repeats, and the behavior drifts. Specialization fixes that.

An agent definition is not just a prompt. It is a stored judgment boundary. It says “for this class of problem, this is how the work gets done.” That turns out to matter more than I expected.

What changed once the agents were real

The first change was speed, but not in the shallow sense of faster responses. The real gain was lower context-transfer overhead. I can open a competitive-intelligence agent and start at the level of the problem. I do not have to spend the first ten minutes teaching it what counts as signal, what kind of synthesis I want, or what I consider sloppy thinking. That work is already done.

The same is true for a personal-writer, a code-reviewer, a financial-strategist, or an outreach-strategist. Each one starts from a higher baseline because the standards are already built in. That baseline matters more than raw model capability. In practice, the difference between a strong generic assistant and a specialized one is often not intelligence. It is starting altitude.

The second change was consistency. One of the quiet frustrations of normal AI use is that you can get excellent output on Monday and strangely average output on Thursday from what is supposedly the same setup. With specialized agents, the variance drops. Not to zero, because these are still probabilistic systems, but enough that they begin to feel dependable.

That is the threshold that matters. If a system is occasionally brilliant but frequently mediocre, you still have to supervise it too closely to trust it as part of your workflow. Once the floor rises, the relationship changes.

What I learned building it

Specificity is expensive, but worth it.

Writing good agent definitions forced me to make my own standards explicit. “Be analytical” is useless. “Lead with the downside scenario, distinguish fact from inference, and state when the evidence is thin” is useful. The same pattern held everywhere. Vague definitions produced vague agents. Sharp definitions produced behavior I could actually rely on.

This was part AI configuration and part self-knowledge. A lot of what experienced professionals call judgment is really a stack of unspoken preferences, sequencing habits, and quality standards. Building agents forced me to surface those things.

Memory helps only when the abstraction is right.

An agent should remember durable preferences and recurring patterns. It should not remember every incidental detail. If memory is too broad, it becomes noise. If it is too narrow, it becomes brittle. The useful level is usually something like: “I prefer a direct opening” or “I want downside scenarios modeled first,” not “I wrote an email to this person in March.”

Generalists break down faster than specialists.

One of my early mistakes was building agents that were still too wide inside their own domain. A broad business-strategist sounds efficient until you realize it is trying to do market analysis, messaging, financial reasoning, and strategic prioritization with one prompt identity. In practice, that meant it had no sharp point of view on any of them.

Splitting that work into narrower specialists made the system noticeably better. Narrow agents are easier to define, easier to improve, and easier to trust.

The skills matter as much as the agents.

The 25 agents get the headline, but the deeper asset is the 117-skill library underneath them. That is where reusable behavior lives. When I improve a skill, every agent using it improves. When I create a new agent, it inherits years of accumulated structure instead of starting empty. That is when the system starts to compound.

What this is really evidence of

It would be easy to frame this as a productivity hack. That would miss the point. What I built for myself is a small proof of a larger idea: AI gets more useful when it stops being treated like a tool you invoke and starts being treated like a component inside a designed system.

That system does not need to be large. Mine is personal. But the underlying principle is the same one I wrote about in Collaborative AI: capability alone is not enough. Structure, boundaries, memory, evaluation, and specialization all matter.

A single model can be impressive. A well-structured system is dependable. That is the distinction I keep coming back to.

The interesting thing about Agent Factory is not that I have 25 agents. The interesting thing is that a specialist team, even when the specialists are AI, is often a better design than one general intelligence asked to do everything.

This post is only the starting point. I’ll go deeper into how one of these agents works, where the system fails, and why I built the agents as markdown files instead of code. The core idea is simple: I did not want a smarter chat window, I wanted a working environment.

More in the next post.

Subscribe to The Algorithm

Notes on building AI systems that actually work.