Coding agents can write code, but reliable multi-step work depends on the harness around the model.

AI agentic workflow paradigms Source: r/ClaudeCode


The agent looked competent until the workflow got longer

Picture a small infrastructure team trying to standardize how it uses a coding agent across a growing repository.

At first the experience is encouraging. Someone asks the agent to add a CLI flag, wire a metrics endpoint, update a test, and explain the change. The answer is fast, the patch is plausible, and the team starts to feel that a large part of everyday engineering has become compressible.

Then the work stretches across time.

The next request is no longer one patch. It is a chain of tasks: inspect the repository, follow local conventions, split work across a few bounded roles, preserve intermediate findings, ask for approval before risky actions, run validation, and leave enough evidence that another engineer can reconstruct what happened tomorrow. The agent can still generate code, but the run starts to drift. A convention that lived in yesterday’s chat disappears. A review step is skipped because it was implied rather than encoded. One session writes useful notes into a temporary buffer; the next session cannot see them. Another runtime has different assumptions about where reusable instructions belong. Nothing is obviously broken in the model. The workflow around the model is simply too implicit.

That gap is where harness engineering enters the picture.

People often talk about coding agents as though the central question is whether the model can produce code. For short tasks, that framing is good enough. For repeated work in a real repository, it stops being the hard part. The harder question is how the model sees the codebase, what tools it can call, where intermediate state lives, how work is handed off, what gets reviewed, and which parts of the workflow survive the next session or the next runtime.

My project, Meta Harness, starts from that practical problem. It is an attempt to make agent workflows durable at the repository level instead of leaving them trapped inside one host’s hidden assumptions.


Agents work through loops

Conceptually, a coding agent is less mysterious than it first appears once you view it as a loop.

The model observes some state, chooses an action, receives the result, updates its view of the task, and continues until it reaches a stop condition or asks for help. The loop can be simple or elaborate, but the basic structure is stable:

1
2
3
4
5
6
state = observe(task, repo, tools, policy)

while not done(state):
  action = model.decide(state)
  result = run_tool_or_emit_output(action)
  state = incorporate(state, action, result)

That sketch already explains why coding agents feel different from plain autocomplete (more generally, one-shot LLM code generation). Autocomplete predicts the next tokens inside a file. An agent carries state across multiple steps, uses tools, reads the environment, produces artifacts, and sometimes revises its own plan after seeing what those tools return.

Once the loop grows beyond a few iterations, local omissions become expensive. Which files are canonical? Which directories are writable? Should the agent preserve intermediate reasoning as a report, a checklist, or a deterministic handoff file? When a risky action appears, should it stop, ask, retry, or route to review? If a run is split across roles, how does the next role know what the previous one actually found rather than what it vaguely intended?

Note that they are all workflow questions.


The harness is the surrounding control plane

A harness is the control surface that turns that loop into a usable engineering workflow.

It defines what the agent can see, what it is allowed to call, how it should interpret repository policy, where reusable instructions live, how intermediate artifacts are named, how quality checks are enforced, and how the run is made legible after the fact. Context, permissions, artifact contracts, review edges, retries, and durable memory all live here.

This matters because real engineering work is full of state that does not naturally fit into a single prompt. Some of it is stable and should be versioned. Some of it is temporary and should still be inspectable. Some of it is procedural and should be written down once rather than reconstructed from chat history over and over again.

You can ignore the harness in a demo. The more often the workflow is repeated, the less optional that layer becomes.

That is why I think the most useful way to model current coding agents is not “LLM plus tools” in the abstract. It is “LLM loop plus a harness that decides how the loop touches reality.” The model generates the next step. The harness determines whether the step is reproducible, reviewable, and worth trusting.


Why harness engineering is getting attention now

Harness engineering has become more visible because the bottleneck has shifted.

On February 11, 2026, OpenAI published “Harness engineering: leveraging Codex in an agent-first world”. The part that stood out to me was the description of engineering work moving upward into environment design, repository knowledge, and feedback loops. Once an organization can plausibly ship substantial software with agent help, repository instructions, tests, validation surfaces, and context boundaries stop being support material. They become part of the production system.

The operational consequence is straightforward. Teams no longer need only better prompts. They need a system of record that agents can read repeatedly without re-deriving the same conventions from scratch. If that system of record remains informal, every new run pays a tax in drift.

Another signal came on March 9, 2026, when the Model Context Protocol maintainers published “The 2026 MCP Roadmap”. The roadmap emphasizes transport evolution and scalability, agent communication, governance maturation, and enterprise readiness. Those are not the concerns of a toy tool-calling wrapper. They are the concerns that appear when agent workflows start touching production systems, teams, and organizations.

That change matters because once tools, transports, and agent-to-agent coordination become explicit protocol concerns, the harness stops being an internal implementation detail. It becomes part of how the system is designed, audited, and governed.

There is a third reason this layer is easier to see now: different agent ecosystems are converging on repository-local customization. The OpenHands repository customization docs center a .openhands directory, project-specific skills, setup scripts, and pre-commit checks. Even if the runtime differs, the engineering instinct is similar. Put durable instructions near the code. Preserve repeatable behavior as repository artifacts. Treat agent behavior as something a team can inspect and maintain, not as a secret living in one conversation.

Taken together, these shifts explain why harness engineering now has a name that more people recognize. The model got strong enough that the surrounding workflow became the visible constraint.


What I kept from the original Harness

Meta Harness is an adaptation of revfactory/harness, rather than a pure reinvention everythong from zero.

The original project already had the most important conceptual pieces. It treated multi-agent or multi-role work as something that should follow an explicit workflow rather than ad hoc improvisation. It preserved a six-phase structure: domain analysis, team architecture design, role and artifact definition generation, skill generation, integration and orchestration, and validation and testing. It also organized coordination around a useful pattern catalog: Pipeline, Fan-out/Fan-in, Expert Pool, Producer-Reviewer, Supervisor, and Hierarchical Delegation.

Those ideas travel well because they describe decomposition, review, and handoff, not one specific host application.

I also wanted to preserve the upstream emphasis on progressive disclosure. The main skill should stay readable. Bulky edge cases, templates, and pattern guidance should live in reference documents. That design choice matters more in agent workflows than in ordinary documentation because overloaded instruction files quickly become both harder to select and harder to maintain.

The same goes for QA. The upstream project clearly understood that agent workflows need explicit quality boundaries. A strong harness does not assume that generation quality will emerge from enthusiasm. It makes review a first-class part of the design.

What did not travel cleanly were the runtime-specific assumptions. The original Harness was shaped around Claude Code plugin and agent-team conventions. Inside that environment, assumptions about generated .claude/agents, .claude/skills, direct agent messaging, host-specific packaging, and other runtime conveniences were coherent. They were part of a working system.

My concern was durability. Those assumptions are fragile if the repository itself is supposed to carry the workflow across time, teams, or clients, not to mention agent tools.


Why I moved the contract into the repository

That concern is what led to Meta Harness.

The project keeps the portable core and rewrites the control surface so the repository, rather than the host runtime, carries the durable contract.

At the top level, AGENTS.md becomes repo-wide policy. Reusable behavior lives under .agents/skills/. Durable workflow specifications live under docs/harness/. Intermediate state is handed off through deterministic _workspace/ files. The main skill at .agents/skills/harness/SKILL.md defines when the harness should be used, what inputs it requires, what artifacts it generates, which defaults are meant to stay portable, and how the six-phase workflow and pattern selection should be applied.

That main skill stays intentionally lean. The details move into references such as agent-design-patterns.md, autonomous-experimentation.md, orchestrator-template.md, team-examples.md, skill-writing-guide.md, skill-testing-guide.md, and qa-agent-guide.md. I wanted the project to preserve serious workflow guidance without collapsing into one giant instruction file that every future user would need to mentally parse in a single pass.

The generated artifact model is equally important. Meta Harness encourages the smallest durable package that fits the domain: a team spec such as docs/harness/{domain}/team-spec.md, specialist skills when the behavior is stable enough to reuse, role briefs when a role needs a durable contract but not a full skill, and deterministic handoff files like _workspace/{phase}_{role}_{artifact}.md.

This is a different philosophy from live agent chatter. If the work matters, another person should be able to inspect the handoff surface without replaying an entire conversation from memory.

The same logic shapes the autonomous experimentation support. Meta Harness treats autonomous experimentation as a workflow profile, not a seventh architecture pattern. It composes with patterns such as Pipeline, Supervisor, or Producer-Reviewer. The key requirement is that the mutable surface, immutable evaluation surface, baseline run, comparison rule, keep or discard policy, and failure policy are declared up front. The run ledger then lives in deterministic files such as _workspace/experiments/{run}/results.tsv, along with baseline.md and final-summary.md. That design owes more to auditability than to glamour, which is exactly why I think it matters.

Two scripts make the project feel like engineering rather than prose.

The first is scripts/install_harness.py. It installs the canonical harness skill either at project scope or user scope and supports multiple layouts: standard, forgecode, droid, openhands, and aider. That is a small but important statement about the project. The goal is not to bless one runtime forever. The goal is to preserve a shared durable skill tree and add client-specific mirrors only where they are actually useful.

The second is scripts/validate_codex_port.py. It checks required files, internal links, skill headings, pattern coverage, compatibility guidance, and the absence of stale legacy runtime tokens such as old .claude/* paths or host-specific commands that no longer belong in the canonical docs. Many agent projects stop after describing the workflow. I wanted this one to verify that the workflow description itself had not drifted.

Put differently, Meta Harness is trying to treat the harness as a repository contract with tests, not just as a clever prompt.


What teams gain when the workflow becomes inspectable

The most immediate gain is portability.

Teams are already mixing clients, sandboxes, and local conventions. One engineer may prefer a Codex-style environment. Another may run OpenHands. A third may want Aider-style local assistance. If the durable workflow only exists as one runtime’s internal abstraction, every change of environment becomes a partial rewrite. If the durable workflow lives in versioned repository artifacts, the host can change while the control surface remains legible.

The next gain is clearer review.

A lot of current agent enthusiasm still assumes that coordination will take care of itself. In practice, teams need to know which role owned which artifact, which review step was required, what evidence supported a recommendation, and what happened when something failed. File-based handoffs and explicit team specs are slower than magical hidden messaging in the narrow sense that they force the workflow to become visible. They are faster in the broader sense that they reduce reconstruction cost when something goes wrong.

Debugging improves for the same reason. An agent run that leaves behind structured intermediate artifacts is easier to audit than one that simply emits a final patch and a confident summary. If a specialist reached the wrong conclusion, a reviewer can inspect the actual handoff surface. If a workflow is repeatedly failing at the same phase, the failure is no longer buried in private runtime state.

This also changes how repository knowledge ages. Teams lose a surprising amount of agent effectiveness not because the model regresses, but because the instructions around the model remain soft. A convention exists, but only as team memory. A review policy exists, but only as a sentence in an old issue. A useful decomposition exists, but only in the head of the engineer who first got the workflow working. Turning those pieces into AGENTS.md, reusable skills, team specs, and validation logic makes them available to the next run and the next person.

That is why the OpenAI post’s emphasis on repository knowledge resonated with me. The interesting shift is not merely that agents can write more code. It is that repositories now need to carry more operational knowledge in a form agents can repeatedly consume. Meta Harness is my attempt to make that shift concrete.


The durable part of agent work

Coding agents can be modeled quite simply: repeated model calls, tool access, observations, and updates inside a loop. That model is useful, but it is not the full engineering story.

The practical difficulty begins one layer out, at the point where loops need structure. A harness decides how the loop touches the repository, how work is split, how state is preserved, how review happens, how failure is recorded, and which knowledge survives the session. That is why harness engineering has become easier to notice in 2026. The better the model gets, the more obvious the surrounding workflow becomes.

Meta Harness is my attempt to push that workflow into versioned, inspectable repository artifacts. It keeps the original Harness ideas that describe work well: phased design, explicit patterns, progressive disclosure, and QA. It drops the parts that were too tightly bound to one host runtime. In their place it uses repo-local skills, durable team specs, deterministic _workspace/ handoffs, compatibility layouts, and validation scripts that treat the harness itself as something worth maintaining carefully.

If this view is right, teams should spend less time asking only which prompt produced the best patch and more time deciding what belongs in AGENTS.md, which specialist behaviors deserve reusable skills, which handoffs should be visible in _workspace/, where a team spec is better than a chat habit, and what should be validated so the workflow does not decay.

The model can generate the next step, but the harness is what makes the step reusable.