I Let the Harness Build Itself Overnight

James Spalding

Building in the open · Dublin

The short version

I had a benchmark harness. It was fine. 47% average on five ARC-AGI style tasks, one single model (qwen3:14b), one single pipeline, one single direction of execution. No self-modification. No critic. No adaptation. No nothing. It was, in the kindest possible sense, a while-loop with delusions of grandeur.

Two ideas had been rattling around in my head for weeks. One was HyperAgent, the adaptive multi-agent architecture out of the Meta researchers' paper, which describes fixed role ordering plus autonomous specialist agents beating centralised orchestration by a meaningful margin. The other was Autoresearch, Karpathy's offhand concept of agents that run overnight on their own research loop while you sleep. No one was telling me these two ideas belonged in the same sentence. I put them in the same sentence anyway.

Then I pointed it at my own code. Alongside Claude Code. And went to bed for three nights in a row.

This is what the harness looked like when I woke up.

The baseline I started from

Before anything adaptive existed, the harness looked like this. One model, one pipeline, one region, one script. I am not ashamed of it. Every autonomous pipeline starts here.

47%

avg score

0/5

self-modifying

30 min

per run

model (qwen3:14b)

3.6 GB

Ollama weights

$31

March Fly bill

88% of that $31 was factory idle time. The model was burning money to think about nothing. I was running a full Ollama stack on Fly.io because I hadn't separated the inference plane from the orchestration plane yet. That is a sentence I would not have understood four weeks ago.

Two ideas I couldn't stop thinking about

HyperAgent in one line: stop putting a central orchestrator in charge. Give each role (Analyst, Critic, Implementer) a fixed position in the pipeline, let each one be autonomous inside its lane, and route the output forward rather than upward. The paper is full of numbers. The one I cared about was that this architecture beats centralised orchestration by around 44% on multi-step reasoning. I read that and thought about how many of my own pipelines had a single model pretending to be three people and quietly contradicting itself.

Autoresearch in one line: the interesting question isn't what a model can do in one turn. The interesting question is what a research loop can do while the human sleeps. Karpathy wasn't giving a design spec. He was giving a mood. The mood was: let it cook.

Combine them and you get the thing I actually wanted. A harness where each role does its own job, hands off to the next role cleanly, and the whole pipeline iterates on itself overnight without me touching it. Alongside Claude Code as the sidekick actually editing the source tree. That is the part I still find funny. I was not writing the code. Claude Code was writing the code. I was writing the constraints.

The sequential pipeline I ended up with

By the third morning the harness had settled into this shape. This is Run D. Analyst reasons, Critic rejects garbage, Analyst produces a second draft, Implementer applies it. The important word there is rejects. The Critic is not a cheerleader. It rejected 6 of 15 initial proposals on the first pass. That is the part that makes the whole thing work.

Analyst v1

534-868 tokens reasoning → 15 candidate proposals

Critic

6/15 rejected (weak reasoning, unsafe edits, already tried)

Analyst v2

855-1170 tokens → revised proposals with failure context

Implementer

15-30 tokens → minimal surgical patch applied by Claude Code

If you take one thing from this post take that. The Critic has to be allowed to say no. Most pipelines I have seen fail because the loop is a yes-loop. A yes-loop is just a faster way to write worse code.

Run history: the shape of three nights

Here is what the runs actually looked like, in order. A and A2 were the last pure local runs. B was a bad refactor I should not have shipped. Opus-4B was an experiment with a different orchestrator that collapsed on me. D is where I combined HyperAgent and Autoresearch and let it run overnight.

Run A (baseline)47.5%

Run A240.0%

Run B (bad refactor)20.0%

Opus-4B (dead end)0.0%

Run D Iter 160.0%

Run D Iter 20.0%

Run D Iter 3 (self-mod pass)40.0%

Two things to notice. First, Run D Iter 2 scored zero. That is the point. The Critic told the Implementer to try a structural change that blew up the entire run. That failure informed Iter 3. A benchmark that never fails is a benchmark you cannot learn from.

Second, Run D Iter 3 scored 40% average, but it is the first run in the entire history of this harness to pass the self-modifying system task. That is the task that asks the pipeline to read its own source code, propose a change, apply it, and verify the change improved the score. Every previous run got a zero on it. Iter 3 got 70. I stared at that number for a full minute before I believed it.

Local versus cloud: the part that surprised me

I assumed local qwen3:14b on my own machine would always be the fastest thing available because there is no network. I was wrong. Cloud inference via the 8GI model proxy to an Amsterdam vessel running a free tier model was roughly 15 times faster per Analyst step.

37-123s

local analyst

2-7s

cloud analyst

15x

faster per step

98%

factory idle removed

The reason local was slow is obvious in retrospect. My laptop has one GPU. It was serving the Analyst, the Critic, and the Implementer from the same 14 billion parameter weight file, doing full reloads between calls. The cloud vessel had a free tier model with a hot cache and nothing else to do. I was paying for "local means private" with a tax called "local means serialised". Separating the inference plane from the orchestration plane was the single highest leverage change I have made in the last month.

The cost picture from Fly.io tells the same story from the finance side. March bill was $31.40 across 6 apps with 88% of it burning on idle factory workers. April is on track for around a tenth of that, from 4 apps, with a 59 MB lean vessel image that holds the orchestrator only and calls out to the model proxy for inference. That is not a frugality choice. That is a separation of concerns choice that happens to also be frugal.

The first self-modifying pass

This is the bit worth framing. The harness has five ARC-AGI style tasks. Four of them are variants of the usual suspects (grid transformation, sequence induction, concept binding, rule inference). The fifth is called self-modifying system. It hands the pipeline a deliberately wrong implementation of its own logic and asks it to fix the source file, rerun the benchmark, and show the score went up. Every previous run scored zero. Run D Iter 3 scored 70.

What actually happened is embarrassingly simple. The Analyst read the wrong implementation. The Critic rejected the first two proposed fixes as "surface level symptom hiding". The second Analyst pass produced a minimal patch. The Implementer wrote it out. Claude Code applied it to the file. The benchmark rerun passed. No one was in the loop. I was asleep.

The reason it worked this time and never before is the Critic. Earlier runs had an Analyst that was allowed to mark its own homework. A self-marking homework pipeline cannot debug itself because every wrong answer reads as right. The instant you insert a Critic that is genuinely allowed to say no, the whole thing becomes debuggable from the inside.

Why 8GI wins, I think, eventually

Every benchmark thread online at the moment is a score for a raw model. That is the wrong unit. The model is the engine. The engine is not the car. What Run D actually measured is the gap between a raw qwen3:14b result and the same model running inside a pipeline with Analyst / Critic / Adaptive HyperAgent / No-BS Critic wrapped around it. The gap is roughly double on hard tasks. On the self-modifying task it is infinity to zero, because zero times anything is zero.

That is the 8GI stack. Model proxy underneath. Sequential pipeline in the middle. Adaptive HyperAgent on top. No-BS Critic across the whole thing telling me when my own work is weak. It is four parts. None of them is the model. All of them are the reason the model does anything useful.

What I actually learned

The Critic is the pipeline. Everything else is negotiable. If you cannot afford a critic model, the Critic is a validator function that checks the last patch against the actual test output. Either way, something has to be allowed to say no.
Local is not free. Local means serialised unless you have more than one GPU, and most of us do not. Cloud with a free tier model on an adjacent vessel can be cheaper than running a 14B model on your laptop because your laptop is already trying to do seventeen other things.
Failure runs are data. Run D Iter 2 scored zero. That zero taught Iter 3 exactly what structural change not to make. I used to throw failure runs away. I do not do that anymore.
Overnight is the point. Autoresearch is not a framework. It is a permission slip. Claude Code is a paid colleague. The 8GI harness is an unpaid one. Both of them can keep working after I log off. I used to feel weird about that. I do not anymore.

Where this goes next

Seven more vessels, each running a different free tier model, deliberating against each other through the model proxy. A circuit breaker that kills any vessel costing more than $2 a day. A model queue where any new open weight model can be dropped in and scored against the current champion without code changes. An ablation harness where I can turn off the Critic, the HyperAgent layer, or the No-BS Critic individually and watch the score collapse.

The goal for Friday April 4 was to launch this publicly as the default way 8GI ships inference. It did launch. It is running now. The harness that built itself while I slept is the harness that scored every public benchmark I have posted since. That is the part that still sits weird with me. I did not write most of Run D. Claude Code and the harness wrote Run D. I wrote the conditions under which they were allowed to cook.

If you want the full picture

The deck with every graph, the full pipeline diagram, the Fly.io cost breakdown, the eight cloud nuances I had to debug at 1 AM, and the model queue lives here: /presentations/run-d-architecture-evolution. It is inside the gate. If you are reading this you probably already have access.

The model is the engine. 8GI is the car. And the car, it turns out, can build its own dashboard if you let it cook.

View the Run D architecture deck

By James Spalding · Dublin · 8GI