In Which I Intervene in the Code

(Image credit: Hang Xu)

Three issues emerged that warranted some manual intervention this week. All three have prompted me to refine the process a bit more.

  1. After auditing the codebase for needless defensive guard code, neither Claude nor Codex managed to find any.
  2. The overhaul to how sources / destinations are specified silently broke the Fleeing mechanic for burning orcs.
  3. Low usage of Haiku and Sonnet compared to Opus.

So here’s what happened, and what I did about it.

Apparently I’m doing a series of blog posts about this project!

Part 1: Six Days Equals Six Weeks

Part 2: Specialists in the Factory

Part 3: The Invisible Work Matters

Part 4: Progress, 37 Days in

Part 5: The Meta-Game Begins

Needless Guard Code, and Hallucinated Requirements

Claude loves writing defensive code. This came to a head after I prompted Claude and Codex to audit the code for potential needless guard code, with a hint that the serialized data might yield evidence of hallucinated requirements. The genesis of this was me having observed such things in a few places and getting confirmation from other engineers that Claude loves defensive code.

Defensive code comes with drawbacks. It can mask real bugs (as seen before), and is corrosive to performance. In the case of Unity, there’s a tendency to write “self-healing” code that turns an oversight in the scene / prefab editor into a runtime cost. Wiring associations between components together can often be done at edit-time, or done via a query on the relevant GameObject at runtime. That query is not cheap. The defensive code turns a forgotten drag-and-drop operation into an operation that could become materially important if it happens frequently.

Frequently like, say, spawning a Projectile. An operation that can happen dozens of times every second in an especially intense battle. To be fair, that would be more of an issue before I switched over to object pooling. Still, if I’m seeing the pattern in one place it’s definitely showing up in a bunch of others.

The issue with Projectile runs a bit deeper though, and it rather neatly illustrates the image at the top of this article.

The way Projectile was written, on Awake it ran a GetComponentsInChildren to find any and all colliders that may be attached to it. It tucked this information away, and used it to ensure in OnCollisionEnter that the collision wasn’t a self-intersection.

This is wrong on multiple levels:

  1. There was no way for me to wire things up manually and elide this query because the colliders list wasn’t exposed as a serialized field.
  2. Had I been able to wire this up statically, the check could have been performed in make validate, instead of every time a Projectile is instantiated.
  3. The entire requirement for a Projectile to support multiple colliders is hallucinated. In the game, a Projectile is a very simple thing! Every single instance of Projectile uses a single CapsuleCollider.

Unfortunately, neither Claude nor Codex made the connection that not a single Projectile instance uses multiple colliders and nothing in the game design docs suggests that such a case was ever needed.

I wound up rubbing Claude’s nose in it a bit by pointing out a few specific examples of patterns that were emerging that were wasteful.

The end result was some guidance changes in CLAUDE.md / sub-agent definitions, and a few PRs to fix up the things I found. To date, the only thing it found without being pointed directly at it is… a feature that is in the game design docs, for which code has been written, but which hasn’t been wired into the game yet – and is scheduled to be wired in in the current sprint. Hold onto that, because it’ll become relevant later.

I’m still working on how to have Claude / Codex identify opportunities for cleanup. For now, I’m keeping an eye out to see if the process changes result in less of this kind of cruft showing up. I suspect they won’t, and I’ll need to look at a somewhat more structured approach to system design and the protocol for that between Claude and me.

This Is Fine, Said the Orc

I was adding impact particle effects for Fireballs and (ice) Shards. I’m pretty happy with the visual impact (pun intended). While capturing some demo video, I noticed something… odd. Burning orcs seemed rather unperturbed about their circumstances. At first I thought they might be dying before reaching a decision point to change their route. Upping their health did not result in different behavior. So I dove into the code.

I discovered that the recent change to how sources / destinations are specified (to make map creation easier) had been handled in a very… interesting… way. Claude had implemented the work over several PRs in a non-breaking way. There were the old codepaths, and there were new codepaths, and NavAgent would select the appropriate path based on how the grid was configured.

The problem? Code that looked like this, just above where the key logic happened:

if (UsingGateMode()) {
  DoGatePlanning();
  return;
}

Set aside for a moment that the old codepaths are defunct now that the game is fully cut over to the new ones. That turned out to be handy for undoing this.

The DoGatePlanning method was basically a copy-paste of everything that followed below the snippet above, adapted for the change in logic. Everything, that is, except for the special-case handling of isFleeing. I am genuinely baffled by this, because the fleeing logic was already well-established before the structural change to sources and destinations was started.

Fixing the code was as simple as copying a block gated on if (isFleeing) into DoGatePlanning. No further changes required. I did not delegate this to an LLM.

I filed an issue tagged regression, because I am keeping metrics on regressions as a key indicator of how sustainable the software factory process is. I also decided to use the ticket as the home for an audit of what went wrong, and what process improvements can be made to mitigate it in the future.

I proceeded to have Claude and Codex perform their own respective post-mortems and file results on the issue.

Claude dutifully summarized the issue in one sentence, and gave me a clear timeline of events including specific commit SHAs and issue / PR numbers.

It then went into some depth of the mechanics of how this slipped through the cracks in its very careful refactoring dance. Those details are not terribly important. What’s more important is its tacit admission that a key part of the problem was… insufficient integration tests. It didn’t frame it that way, but that’s what it was describing.

The two LLMs had various specific recommendations. Add the missing integration tests. More stringent guidance about the process of handling refactoring. Fix the other feature broken by the same change. Remember that feature I told you to hold onto because it would become relevant later? It’s later now. Yeah, the whole feature another Claude identified as detritus was in fact also broken by the refactoring change. D’oh.

I’m still working on how to get more / better integration tests out of Claude, but this incident has highlighted it as a particularly critical weak spot in my current process.

Haiku, Sonnet, and Opus

I really like Opus. It’s a great model. It’s not perfect – there are still cases where Codex just flat out does better. But. Opus and Codex have the benefit that they require considerably less guidance and oversight to get useful results from. Even with a process designed to parcel work out to specific models under specific conditions it’s easy to fall into the trap of working in a way that results in everything being punted to Opus.

I developed a simple tool called clauditor to help understand my model usage. It just churns through your local Claude session histories and gives a neat little report as a table, CSV, or JSON. It’ll even give you a cross-tab view of just Anthropic models. Token counts, and cost estimates, broken down by date, project, and model.

The results were a little embarrassing.

For the past 30 days (Claude only keeps a 30-day rolling window of session data in ~/.claude), my token usage (including, crucially, cache tokens!) was:

  • Haiku: 283.0m
  • Sonnet: 3.2b
  • Opus (all versions): 3.0b

Having Claude analyze the session history, I got confirmation of something I already knew courtesy of a friend: Having Claude supervise sub-agents is expensive. Claude will burn a lot of tokens on that supervision. My friend’s strategy is to have a bunch of running agents connect to an IRC server and have supervision happen via that channel. He’s been happy with the results, and has found it to be considerably more token-efficient.

But. Even if a lot of the Opus usage is supervision, a lot of it is not. And Haiku – the cheapest model – is just… starved for things to do.

Today, I had Claude update the processes to support Fable as a last-resort “this is too hard for Opus” implementer sub-agent. I also took the opportunity to have it review the processes and figure out how to break work down so more of it can go to Haiku and/or Sonnet.

The gist of the findings was that the process is too graceful about handling ambiguous requirements. By design, Opus is roped in for anything involving a “genuine design decision.” Since a lot of the issues I file are pretty minimal, there’s a whole set of steps around issue triage. Identify ambiguities and file them as comments on the issue. Include 2-3 options for how to proceed on each one, including tradeoffs and a recommendation. That sort of thing. It’s very convenient.

What happens when it has its answers? The supervisor tells an implementer-opus instance to go do the implementation.

D’oh.

So now, the triage process in particular has a stronger emphasis on “get the information, break the problem down into discrete units of work, have appropriate agents handle each unit.”

We’ll see if the model usage ratios improve.

Progress Thus Far

So how goes the project overall? It continues to shape up at a quick clip. It’s gradually looking better and better. The first meta-game element is in, and working. I’m well on my way to having account linking working, supporting cross-device play and avoiding having people’s progress be lost if they lose / switch devices.

That said, it’s been a slow week because I was visiting my parents in Arkansas.

Anyway, here’s some videos showing the hit-splash effects. With the fleeing mechanic fixed, of course.

Comments