After several months of agentic development, I found that generated code was abundant while confidence in the generated system stayed scarce.

agent vs harness


The Review Started To Matter More Than The Patch

Picture the end of an agent-assisted coding session. The terminal says the tests pass. The agent has summarized the change with the usual confidence. A new module exists, the examples run, and the diff is larger than anything I would have written by hand in one sitting. I still have to read it.

In one project, I am looking at a numerical routine that converges on the small cases while still needing to behave sensibly on harder inputs. In another, I am reviewing an econometric workflow whose output looks plausible, with filtering and adjustment steps that need to be inspectable. In a third, I am checking an ML evaluation tool where one wrong assumption about data splits could make the entire result misleading. The agent has helped produce a lot of code. That part is visible immediately. What takes longer is deciding whether I understand the system well enough to trust it.

That review moment has shaped how I think about agentic development.

When I started experimenting more seriously with coding agents, my questions were practical. How capable are these systems? Can they help with the kinds of analytical, statistical, and scientific software I actually build? How expensive is the review? Can I keep the workflow modest?

Rather than going fully all-in, I used ordinary subscription tiers, personal projects, normal local development environments, and human review throughout. The obvious question at the beginning was how much code these systems could generate. After several months, a different question mattered more: how much of the generated system could I confidently understand, verify, and maintain?


The Projects Were Experiments, But They Were Also Real Tools

The projects I used for this were experiments in agentic workflow, closer to real tools than disposable demos.

Some were numerical or statistical tools. Some analyzed codebases and dependency graphs. Some were Rustlings-style learning systems. Others involved synthetic data generation, NLP utilities, or model evaluation workflows. The exact list matters less than the constraint they shared: I wanted to use them after the first generated version existed.

That changes the meaning of success for an agent-generated artifact.

A toy app can succeed if it runs once and shows the shape of an idea. A real tool has to survive contact with new inputs, future changes, and the unpleasant moment when a result looks wrong and someone has to explain why. If an agent generates a script that produces a number, that is useful. If the script hides the assumptions behind that number, the review cost moves downstream. I still have to understand the transformation, inspect the intermediate state, and decide whether the output deserves confidence.

One econometric workflow made this concrete. The agent produced a reasonable-looking table and a clean report, but the generated pipeline had handled missing values in two different places: once during feature construction and again before model fitting. Neither choice was bizarre on its own. Together they changed the denominator in a way that was easy to miss from the final output. The fix was not just a code change. I needed the workflow to expose row counts, filtering reasons, and adjustment metadata so I could tell which observations survived each stage.

That is the kind of failure analytical software invites. A generated chart, coefficient table, graph summary, or evaluation report can look polished while carrying a questionable preprocessing choice. The dangerous case is often a plausible result with no easy path back to the assumptions that produced it, rather than a syntax error or a crashing test.

These projects became a useful testing ground because they were small enough for personal experimentation and serious enough to require independent review.


Generation Scaled Faster Than Review

The first surprise was how quickly implementation stopped being the only bottleneck.

Coding agents are good at producing volume. They can scaffold modules, wire command-line interfaces, add tests, write adapters, translate patterns across files, and perform refactors that would otherwise feel tedious. When the surrounding repository gives them enough context, they can move through ordinary implementation work at a pace that still feels strange if you remember doing every edit manually.

Review follows a different clock.

The time saved during implementation often reappeared as a different kind of work: reading generated architecture, validating statistical assumptions, checking algorithmic details, designing better tests, inspecting sample outputs, and looking for edge cases that stayed hidden in the first pass. Some of this is normal engineering work. The point is that agentic coding made the imbalance easier to see. The code arrived quickly, but confidence arrived slowly.

That observation reinforced my human-in-the-loop position. I hold that position because coding agents are strong enough to generate systems that exceed what I can comfortably review casually. Once that happens, supervision becomes more important.

The review problem also changed what I valued in the surrounding process. A better prompt helped. A stronger model helped. More context helped. The largest gains came from making the work easier to inspect: clearer plans, narrower interfaces, better tests, more explicit failure states, and intermediate artifacts that showed how the output was produced.


The Model Was Only One Part Of The System

During this period I tried multiple model families and multiple agent harnesses. That experience is more useful here as calibration than as a benchmark section. The comparisons were informal, and the projects were too different to support clean ranking.

The useful observation was more practical. At first, it was tempting to attribute most differences in outcome to model capability. When one run worked and another wandered, the obvious explanation was that the first model was smarter or the second model was weaker. Sometimes that was true. Stronger models remained valuable, especially when the requirements were ambiguous, the design space was open-ended, or the repository context was large.

As my supervision improved, another pattern became harder to ignore. Some differences that looked like model-quality differences were really process-quality differences. A vague task with weak tests gave even a strong model too much room to produce a plausible but awkward answer. A crisp task with clear invariants, local examples, and a good validation loop made less expensive or less frontier models much more useful.

I saw this most clearly on tasks that touched data contracts. A stronger model could often infer that a new field needed to move through a parser, a validator, a transformation step, and a report. A smaller model sometimes updated only the obvious call sites and left one downstream artifact stale. Once I wrote the task as a contract change with expected input and output examples, the gap narrowed. The model had a path: update the type, update the transformation, update the report, run the workflow, compare the audit output.

That changed the way I evaluated agentic development. The model mattered as one part of a larger system: harness, repository structure, instructions, tests, review standards, and the human’s ability to notice when the system had drifted. Improving that system made outcomes more predictable.

This also made me less interested in one-shot stories about which model “won.” For real work, the more durable question was where model judgment was actually needed and where the task could be made mechanical enough for several models to handle safely.


The Collaboration Protocol Learned Alongside The Projects

My early workflow was simple: prompt, generate, review, repeat.

That worked for short tasks. It became less satisfying as the projects became more serious. After each project, I found myself updating the surrounding protocol. I changed agent instructions. I tightened review standards. I wrote planning templates. I made coding preferences more explicit. I became more opinionated about when a design needed types, when errors needed to be represented directly, when outputs needed audit metadata, and when a test suite needed real data rather than only small synthetic cases.

The protocol changed through small failures. If an agent produced a clever implementation that was hard to review, the next instruction emphasized boring structure over cleverness. If a plan skipped failure modes, the next template required them. If a generated summary said “all tests pass” without naming the command, the review standard started asking for the actual command and result. Each adjustment patched the collaboration surface.

The collaboration process was learning from the work alongside the model.

This is easy to miss because agentic coding often presents itself as a conversation with a model. In practice, repeated use turns into a small engineering system. The system includes written norms about architecture, testing, error handling, observability, and review. It includes repository instructions that survive a chat session. It includes examples that tell the agent what local quality looks like. It includes the human habits that decide when a generated answer is good enough and when it needs to be decomposed.

As those norms became more explicit, I relied less on the agent inferring them correctly every time.

This is the first place where ordinary software engineering practice started to look different to me. It organized code, but it also organized collaboration with a system that can act quickly and needs a visible contract.


Software Engineering Is Coordination Technology

Software engineering practices evolved to coordinate teams of humans. Requirements analysis aligns intent. Planning reduces ambiguity before implementation begins. Design documents preserve architecture across time. Tests align expectations about behavior. Documentation keeps context from living only in someone’s head.

Agentic development changes the participants while leaving the coordination problem intact.

An agent still needs to know what matters. It needs to know which constraints are binding, which conventions are local, which files are authoritative, and what counts as success. A human reviewer still needs to know what the agent thought it was doing. If those pieces remain implicit, the model fills gaps with plausible defaults. Some of those defaults are helpful; others are wrong in ways that look fluent until the system is exercised.

Older engineering habits give ambiguity somewhere to land before it turns into code. A plan communicates with another engineer and acts as a control surface for a system that will implement whatever gaps remain. A test catches regressions and defines part of the behavioral envelope the agent is trying to satisfy. Documentation onboards people and preserves context that can be consumed across sessions, tools, and future runs.

AI makes the cost of missing software engineering discipline show up in new places. Ambiguity becomes generated structure. Weak review becomes accepted code. Hidden assumptions become plausible behavior that no one remembers asking for.


Reviewability Became A Design Constraint

Over time, many of my technical choices began to pass through one question: how difficult will this be to review later?

That question pushed me toward FP-style structure more often. By that I mean practical habits rather than pure functional programming as a language identity: explicit data structures, narrow functions, visible inputs and outputs, immutable-ish transformations where possible, and side effects pushed toward boundaries.

Those habits helped because they changed the shape of review. Instead of reconstructing broad mutable state, I could inspect a transformation. What came in? What came out? Which assumptions were encoded in the type, the function signature, and the tests? That made the review surface smaller.

Railway-oriented programming helped for a related reason. Analytical and scientific tools fail in ways that matter: invalid input, unstable convergence, missing columns, impossible configuration, degenerate data, solver warnings, and external dependency failures. Representing those states explicitly gives the reviewer something to inspect. Hiding them behind exceptions, logs, or optimistic happy-path code lets the agent generate something that works on the sample case while leaving the failure behavior under-specified.

This is a reviewability argument rather than a purity argument. Real systems still need mutation, I/O, databases, files, subprocesses, and APIs. The question is whether those effects are visible enough that a reviewer can reason about them. A generated patch that changes a pure transformation is usually easier to inspect than one that changes a stateful service with shared caches, implicit configuration, and several call sites that mutate the same object.

Reviewability also changed how I thought about outputs. I became less satisfied with final answers alone. I wanted intermediate artifacts: transformation outputs, convergence diagnostics, solver warnings, filtering decisions, adjustment metadata, configuration snapshots, and JSON or YAML audit information. Those artifacts made the system inspectable rather than self-certifying, which is the level where correctness can be argued about.

That distinction mattered. Trust rarely came from a polished final output. It came from being able to walk backward through the process that produced it.


Integration Tests Connected Plans To Reality

Unit tests answered one kind of question: does this component behave the way we expect on a focused input?

That was necessary and incomplete. Many agent-generated issues appeared only when complete workflows ran end to end. A parser behaved correctly, then the next stage interpreted a missing value differently. A synthetic dataset passed through the pipeline, while a real dataset exposed a naming convention the code had not handled. A model evaluation script produced metrics, though the report hid the exact filtering decisions that changed the denominator.

Integration tests became the place where plans met reality, especially after the protocol had made the intended behavior explicit.

I increasingly wanted a mix of synthetic and real-world data. Synthetic data helped pin down controlled edge cases. Real data exposed messier assumptions. Together they gave the agent and the reviewer a better target than either one alone. When a workflow test failed, the failure often revealed a misunderstanding that local code review alone would have missed.

This changed the kind of evidence I accepted from an agent. A passing unit test around a parser was useful, but I cared more when the complete run showed the same number of records entering validation, leaving filtering, entering estimation, and appearing in the report footer. That kind of end-to-end check connected the specification to the artifact a user would actually see.

This also changed how I wrote tasks for agents. A request that included only an implementation goal was weaker than a request that included a workflow-level acceptance check. “Add support for this input shape” is useful. “Add support for this input shape and prove it by running the complete pipeline on these examples, preserving the audit output” is much better.

The second version gives the agent a path to demonstrate behavior. It gives the reviewer evidence. It also makes it harder for a plausible local patch to hide a workflow-level break.


Process Reduced Variability

The main payoff was lower variability rather than perfection.

The pattern held across different project types and different agent setups. The details varied, while the direction stayed consistent. Requirements reduced guessing. Plans made the implementation path inspectable before code existed. Tests gave generated code a behavioral boundary. Observability made hidden process state reviewable. Review standards made it harder to accept a patch just because it looked coherent.

This is a familiar idea outside software. Restaurants, laboratories, manufacturing lines, and clinical operations achieve consistency through procedures, checks, standards, documentation, and review. Individual skill still matters. The point of the process is to make the outcome less dependent on individual variation.

Agentic development started to feel similar. The goal was to make the workflow less fragile in the face of model differences rather than pretending those differences would disappear.

This is why I became more comfortable assigning substantial portions of work to smaller or less frontier models when the task was well bounded. The surrounding process reduced the amount of judgment they had to invent. If the data model was clear, the tests were specific, the failure behavior was explicit, and the review surface was narrow, the task became more tractable.

Model capability still mattered for open-ended design, large-context synthesis, and ambiguous requirements. The change was more specific: better process moved more work out of the “invent the right thing” category and into the “satisfy this visible contract” category.


The Code Still Had To Earn Its Place

Looking back, the experiment began with curiosity about model capability and ended with a stronger appreciation for supervision.

The tools could generate code. That became the less interesting fact. The harder question was whether I could understand the code, verify the behavior, explain the assumptions, inspect the failure paths, and maintain the system after the original session ended.

This still leaves plenty of room for strong models and lightweight work. Process complements model capability, and many tasks are small enough that a short prompt, a focused diff, and a test run are sufficient. The boundary appears when agentic development moves from one-off generation toward tools people intend to keep.

Over several months, I found myself spending less effort asking which model was smartest and more effort asking whether the workflow made generated work understandable, inspectable, and predictable. Once predictability improved, a much broader range of models became useful.

The practical lesson I keep returning to is that software engineering is more than what happens after the model writes code. In agentic development, it is the interface that lets humans decide whether the code deserves to stay.