Specialists in the Factory

A week ago I wrote about the software factory I’ve been running for Hordes of Orcs 3. Claude, Copilot, and Codex in their right lanes. Five layers stacked on top to keep the output coherent. I said I’d come back in 30 days. I’m coming back in seven, because the process has continued to evolve quite rapidly.

This post will cover what’s changed in the process, and what’s been achieved with it.

Apparently I’m doing a series of blog posts about this project!
Part 1: Six Days Equals Six Weeks
Part 3: The Invisible Work Matters
Part 4: Progress, 37 Days in
Part 5: The Meta-Game Begins

The shift to explicit sub-agents

Out of the box, the orchestrator Claude session will happily pick up any task and run with it. Most of the time that works fine. The problems show up primarily in terms of cost: When an Opus session burns Opus tokens on something a Haiku could have done in a third of the time.

Defining sub-agents fixes both halves of that. Each one gets:

A scope — what it’s for and, just as importantly, what it’s not for.
A model — Opus, Sonnet, or Haiku, chosen for the work the agent does, not the work the orchestrator happens to be doing when it spawns one.
A prompt — the conventions and rails specific to that role, instead of relying on the orchestrator to remember to apply them.

The token-efficiency angle is what I underestimated. Running everything through Opus felt “safe” because my velocity wasn’t resulting in me hitting my weekly Max20 limit. As long as I was the bottleneck, Opus was fine for everything.

The process got good enough that very suddenly I was no longer the bottleneck. I went from 3-ish sub-agents at a time to 10. I blew threw my session limit on Tueaday. I then burned a quarter of my free Extra Usage credits – about $50 – in 15 minutes. So, it was time to optimize things a bit.

With explicit sub-agents, the orchestrator stays on Opus where judgment matters, and the routine work routes to smaller models where it should have been all along.

At present, Opus now accounts for just under half my token usage. A significant improvement.

The agents I’ve defined

The supervisor is directed to use the following agents, as appropriate:

Issue Triager: Sonnet. Triages a single Github issue, or free-form request.
Implementer - Mechanical Change: Haiku. For rote refactoring and other work that doesn’t require a lot of nuanced understanding.
Implementer - Typical Feature Work: Sonnet. The brunt of feature work doesn’t actually involve a lot of complexity, but isn’t necessarily “mechanical”. Some understanding of the product and the domain is needed, so sonnet comes into play here.
Implementer - Real Design Call: Opus. For features that are challenging, or otherwise have complex interactions with the rest of the system.
PR Review Responder: Sonnet. Handles one round of PR feedback from Copilot / Codex / human.
Merge Conflict Resolver: Sonnet. Rebases and reconciles work. By having many sub-agents working in parallel, often on related feature areas, conflicts happen with some frequency and having an agent specialized to resolving that keeps things moving quickly.

Additionally, I’ve devised a few more agents that I invoke explicitly:

Backlog Groomer: Opus. Reviews the Github issues list to find issues that are duplicative, have already been completed, need to be fleshed out further, and so on. This requires knowledge of the product and business, and some nuanced understanding.
Code Pruner: Opus. One area that concerns me is that Claude is very good at adding code but doesn’t always see opportunities to remove it.
Codebase Auditor: Opus. This is the layer 4 / layer 5 process from my last post.
Game Design Critic: Opus. A new one. This is an attempt to tackle higher-level issues than just executing my vision.
PR Reviewer: Opus. A last-resort stand-in for Codex and Copilot. I still need to experiment with having this operate with a lower-end model. The fact that Copilot, a clearly inferior model to even Sonnet, can produce excellent findings for Sonnet to act upon suggests this problem does not require Opus-level intelligence. The challenge is finding a way to get an Anthropic model to be a good foil for an Anthropic model. Group-think is absolutely a problem for instances of a model!

The agents I reach for most are the ones with the narrowest scopes. The temptation when defining a new one is to make it flexible. Resist it. Agents don’t do well with a broad remit. They’re most reliable when given a narrow scope.

Sharpening up the tools a bit

Claude consistently struggles with a few things like not polluting the main worktree. So, I’ve added a PreToolUse hook:

{
  "matcher": "Edit|Write|NotebookEdit",
  "hooks": [
    {
      "type": "command",
      "command": ".claude/hooks/guard-worktree-writes.sh"
    }
  ]
}

The script is a bit verbose so I won’t include it here, but it rejects any edit / write operation in my worktree. So far, it seems to be helping.

When Sonnet bit off more than it should have

The failure mode that drove the latest round of CLAUDE.md changes was a sub-agent pinned to Sonnet picking up a task it wasn’t qualified for.

It started with a vague ticket: I vaguely recalled a technique for rendering translucent meshes that prevented overdraw from overlapping triangles. I knew it had something to do with the depth buffer. That’s about all the detail in the ticket.

What I got was… a translucent rim-lit shader with some Fresnel knob. It… didn’t even really do rim-lighting all that well. And it did nothing about the over-draw. It wasn’t even a situation where some PR revisions could get it over the line because it was so far off base that it would be more work to get from where it was to where I wanted than just starting over. It was a complete waste of tokens.

It seems that shaders are just too deep of a technical topic to rely on a model like Sonnet.

The lesson wasn’t “don’t use Sonnet.” Sonnet is the workhorse of this setup and increasingly does the bulk of the implementation. The lesson was that the orchestrator needs an explicit rule about which kinds of tasks require Opus, regardless of which sub-agent would normally pick them up. That rule now lives in CLAUDE.md, and basically amounts to “anything involving graphics or visuals goes to Opus”.

Managing cost is very much about picking the right model for the task and so it’s worth making sure your supervisory process is specific and detailed in this regard.

So, about that auditing process

This week was the first week I employed the codebase auditor process with a specific remit to focus on bigger picture concerns. The end result was a considerable slew of PRs (with a couple more coming down the pipe) to head off naive patterns that emerge in Unity games and eventually become performance problems. Notably, using Instantiate all over the place can become a problem in terms of garbage collection. It’s fine for relatively “infrequent” things like placing towers or possibly even spawning orcs. It’s less fine for things like projectiles and afflictions where this operation could happen hundreds of times in a single frame.

Claude biased towards an incremental transition for every single architectural change, without prompting. Object Pooling is an opt-in change, so if I have issues with the implementation and need a functional fix right now, I can just switch it off.

I admit to being a little surprised at the good judgement of Claude on this matter.

The week in numbers

So. Did the system keep producing? Last week’s post ended at 183 closed Issues and 82 open after 18 days. Over the past seven days:

Issues filed: 106

Issues closed: 101

PRs filed: 86

PRs merged: 84

Current totals:

Issues filed: 405

Issues closed: 323

PRs filed: 346

PRs merged: 333

What was actually achieved

I added a day/night cycle where there are de-buffs during the day for orcs and de-buffs during the night for towers. I added the Lightning and Fire towers. I made a ton of ergonomic and UX improvements. A ton of internal structural improvements to head off potential performance and maintainability issues down the road. And most importantly, I made a ton of game balance and content adjustments.

What’s next

The sub-agent shift changed the economics of the factory more than anything I expected to find this week. The next thing to watch is whether the audit and documentation agents — the two I’ve used to formalize Layers 4 and 5 of the previous post — can be pinned to models that let me run them as routinely as I currently run the per-PR review loop. If they can, the coherence layers stop being something I personally schedule and become something the factory does on its own cadence.

Still 23 days until the 30-day check-in. The game continues to come along nicely, the process is getting stronger, and progress continues at a velocity I’ve never experienced in a project before.

Welcome to MrJoy

A few random thoughts, mostly about code.