Why I Design CLI-First Software

A command model gives human operators, human builders, coding agents, and operating agents a shared language for work.

GUI vs. CLI

The First Command Exposes The Real Design Problem

Suppose we are building a package that validates datasets before they enter a modeling pipeline. The first version sounds small. It should read a file, apply a collection of rules, report any violations, and optionally write a machine-readable result for a later pipeline stage.

A graphical interface is easy to picture. There is a file picker, a rule selector, a Run button, and a report view. Modern frontend tooling can produce a credible version quickly. A coding agent can scaffold much of it from a short description.

Yet the screen leaves the important questions unanswered. What counts as a validation rule? Can several rules run together? Does one malformed rule stop the entire job? What information belongs in the result? Can another program distinguish invalid data from an unavailable input file? What happens when the user asks for a dry run? Which parts of the operation should remain stable when the presentation changes?

I usually reach for a command-line interface early because it forces those questions into the open. A command needs a name. Its arguments need meanings. Its output needs a destination and a format. Failure needs an observable form. Once the command can run in a script, vague interaction ideas become an executable contract.

For a long time I treated this as a personal preference shaped by Stata, shell tools, and package development. My recent work with Codex and Claude Code has given me a broader explanation. Software interfaces now serve more participants than the familiar pair of developer and end user. Coding agents modify systems through files, commands, logs, and tests. Other agents use software as tools while completing a task for someone. Their presence changes the interface optimization problem.

CLI-first is useful shorthand for my response, though the terminal itself is secondary. The deeper design choice is to define the command model first: a stable set of operations, inputs, results, and failure states that can support a CLI, GUI, API, or agent tool. The CLI is often the earliest and clearest executable version of that model.

GUI Solved The Dominant Interface Problem Of Its Era

Any argument for CLI-first design should begin by taking graphical interfaces seriously. GUI became dominant because it made computers usable for far more people and supported forms of work that command languages handled poorly.

Ben Shneiderman’s classic account of direct manipulation emphasized continuous representations of objects, physical actions in place of complex syntax, and rapid, reversible operations whose effects remain visible. A spreadsheet makes this advantage concrete. A person can select cells, adjust a formula, and watch the result change in its surrounding context. The display carries part of the user’s memory. Menus reveal available actions. Immediate feedback shortens the distance between an intention and its visible consequence.

These properties remain valuable. Image editing, CAD, data visualization, exploratory analysis, and layout work depend on spatial relationships that become awkward when reduced to command strings. A well-designed GUI can also guide an occasional user through a workflow without requiring prior knowledge of its vocabulary. For a product aimed at nontechnical operators, that reduction in recall and setup burden may decide whether the product gets used at all.

Direct manipulation can also help discover the domain model. Watch someone hesitate before dropping a file into a rule panel, reverse an action, or compare two reports side by side, and the useful operations may look different from the ones the builder initially named. A throwaway GUI prototype can keep those questions fluid while a public command surface would make them feel settled. In unfamiliar domains, observing interaction may be the fastest route to learning which commands should eventually exist.

The historical success of GUI therefore says something precise about interface design. When the main problem was helping a human operate software interactively, direct manipulation offered a strong center of gravity. Developers built the underlying system, while users encountered a visual representation of its capabilities.

That two-party description now covers only part of the system. Human operators still matter, and visual work remains visual. At the same time, software increasingly participates in pipelines, automation, and agentic workflows. The interface has become a coordination boundary among several kinds of actors, each carrying different strengths and constraints.

The labels “developer” and “user” become ambiguous once software can act. I find it clearer to separate four roles.

Human builders design, implement, debug, and maintain the software. They care about automation, reproducibility, composition, observability, and compatibility across versions.

Human operators use the software to accomplish a domain task. They often value discoverability, visible state, low setup cost, safe defaults, and feedback that arrives in the language of their work.

Coding agents change the software. They inspect repositories, edit files, run commands, read diagnostics, and use tests to judge whether a change satisfies its instructions. Their effective interface includes the repository and its development tools.

Operating agents use the software. A research agent might validate a dataset before fitting a model. A deployment agent might request a build, inspect the result, and decide whether to continue. These agents need explicit operations and results whose structure survives beyond a single visual session.

One person can occupy several roles, and the same agent runtime can act as either kind of agent. The distinction concerns the task. Changing the validator makes an agent a builder; invoking the validator during an analysis makes it an operator.

SW architecture diagram with four modern participants

The diagram makes one architectural claim. Each participant can receive an interface suited to its work while the meaning of the operation remains centralized. A GUI can preserve visual context. A CLI can support scripts and inspection. An API can avoid subprocess overhead. An agent tool can publish a schema. They should still agree about what validation means, which inputs are accepted, and which result states can occur.

This is where a qualitative ranking of GUI, API, and CLI becomes misleading. “Discoverability: high” or “AI usability: medium” compresses too many design choices into a score. A CLI with poor help and unstable output is difficult for everyone. A GUI with a complete action log may offer excellent reproducibility. The useful comparison concerns artifacts and contracts: what representation does an interaction produce, and can another participant inspect or replay it?

A Command Model Makes Operations Concrete

Consider the dataset validator again. Its domain operation could begin with a typed command and a small result algebra. The Python below is illustrative architecture rather than production-ready validation code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


from dataclasses import dataclass
from pathlib import Path
from typing import Literal


@dataclass(frozen=True)
class ValidateDataset:
  input_path: Path
  rules_path: Path
  output_format: Literal["text", "json"] = "text"
  output_path: Path | None = None
  dry_run: bool = False


@dataclass(frozen=True)
class ValidationPassed:
  rows_checked: int
  rules_checked: int


@dataclass(frozen=True)
class ValidationFailed:
  rows_checked: int
  violations: tuple["Violation", ...]


@dataclass(frozen=True)
class InvalidRequest:
  code: Literal["input_not_found", "rules_not_found", "invalid_rules"]
  message: str
  retry: Literal["after_input_change", "never"]


ValidationResult = ValidationPassed | ValidationFailed | InvalidRequest

This model separates three outcomes that a single boolean would collapse. ValidationPassed means the operation ran and the data satisfied the rules. ValidationFailed means the operation ran and found domain violations. InvalidRequest means the request prevented validation from beginning. A real system might add I/O failures, unsupported rule versions, cancellation, or partial results as the domain requires.

The command also records choices that are easy to leave implicit in a screen. Input and rule locations are values. Dry-run behavior is part of the request. Output format is explicit, though a stricter architecture might move rendering preferences out of the domain command and into the adapter. That placement is a design decision a team can now discuss in code review.

At this stage, the model should remain provisional. “First” describes the architectural dependency. Publication comes later, after a team has exercised an internal CLI alongside a GUI prototype, observed where users struggle, and revised both before scripts or external tools depend on them. Early command models are probes; stable command models are products.

Execution can remain independent of the presentation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


def execute_validate(
  command: ValidateDataset,
  files: "FileStore",
  validator: "Validator",
) -> ValidationResult:
  request = load_request(command, files)
  if isinstance(request, InvalidRequest):
    return request

  return validator.validate(request)

The function receives explicit dependencies and returns a declared result. A GUI callback, shell parser, HTTP handler, or agent tool can all construct ValidateDataset and call the same execution path. The domain core has no reason to know which adapter initiated the work.

This structure is useful for human review as well. A maintainer can test the operation directly without starting a browser or simulating clicks. The central tests describe domain behavior, while adapter tests focus on translation: arguments into commands, commands into calls, and results into the proper external representation.

The Contract Also Covers Effects And Time

A type signature captures only part of an operation. Real commands read files, write reports, contact services, acquire locks, and consume resources. Their contract needs to explain those effects well enough that a caller can decide when and how to run them.

The validator offers a manageable example. Reading a dataset is an observable effect because the file can disappear, change during execution, or exceed available memory. Writing validation.json introduces overwrite behavior and the possibility of a partial file after interruption. Loading rules from a remote registry would add credentials, network availability, and a version chosen at a particular moment. Each detail changes what a human or agent needs to know before invocation.

Dry-run support becomes useful when it has domain meaning. For this command, a dry run might resolve the files, parse the rules, check permissions, and estimate the work while skipping the full dataset scan. The result should say which checks actually occurred. A flag that merely prints “would validate” creates the appearance of safety without giving the caller useful evidence.

Idempotency deserves similar precision. Repeating a read-only validation against immutable inputs should produce the same domain result. Repeating a command that writes an output file may still be safe if replacement is atomic and documented. Repeating a command that registers results in an external system could create duplicates unless the interface accepts an idempotency key or assigns a stable run identifier. The public contract has to settle these questions.

Permissions are part of the model as well. An operating agent may have authority to inspect a dataset and validate its schema while lacking authority to publish a report or overwrite an existing artifact. Splitting validation from publication gives the system a narrower permission boundary. A broad process command that reads, mutates, publishes, and sends notifications makes least-privilege execution much harder.

Time also enters through compatibility. Imagine learning that rules can come from a registry or inline configuration as well as a local path. Replacing rules_path outright would break every adapter. The core could instead introduce a RuleSource with local, registry, and inline variants while the CLI continues mapping --rules to the local variant. New syntax can expose the other sources, followed by a documented deprecation period if the original form eventually becomes inadequate.

That sequence suggests a practical lifecycle: prototype and observe, stabilize the semantic operation, publish a versioned machine contract, then preserve old mappings until an announced breaking release. Additive result fields are usually easier to absorb than renamed status variants. Changed defaults deserve the same care as renamed fields because behavior can drift while the command still parses. Contract tests should cover every public adapter.

Repeated use reveals whether the boundaries match the work. If callers routinely invoke three commands together, the system may be missing a higher-level operation. If every adapter supplies the same awkward combination of flags, the command may be exposing an implementation detail. The CLI becomes a design probe: a cheap way to test vocabulary and semantics before more interfaces depend on them.

The CLI Turns That Model Into An Executable Contract

The shell adapter gives the command model a compact textual form:

1
2
3
4


dataset-check validate patients.csv \
  --rules clinical-rules.yaml \
  --format json \
  --output validation.json

Several details matter more than the punctuation. The verb validate names the operation. The input file is an operand. Options identify the rules and representation. The command can be copied into a review comment, a build script, or a methods appendix. Its behavior can be exercised without reconstructing a visual session.

Mature command conventions also carry useful semantics. POSIX utility conventions distinguish options, option arguments, operands, standard input, standard output, standard error, and exit status. The GNU coding standards add expectations such as consistent long option names and support for --help and --version. These conventions reduce invention for tool authors and memory burden for users.

For the validator, standard output should carry the requested result. Standard error should carry diagnostics intended for the operator. Exit status should distinguish success from states that require different pipeline decisions. The exact mapping belongs in the public contract. For example, status 0 could mean validation passed, 1 could mean domain violations were found, and 2 could mean the request or environment prevented validation. A team may choose different values, but scripts and agents need a documented choice.

Structured output needs equal care. A --format json flag should produce a stable object rather than a JSON-wrapped version of prose:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


{
  "schema_version": 1,
  "status": "failed",
  "rows_checked": 18422,
  "violations": [
    {
      "rule": "age_range",
      "row": 731,
      "column": "age",
      "observed": 241
    }
  ]
}

The field names and status variants become compatibility commitments. Human-readable text can improve over time; machine-readable output needs versioning discipline. Progress bars, warnings, and friendly explanations belong on standard error when standard output is reserved for data. This separation allows a person to watch progress while a pipeline receives clean JSON.

Exit status alone remains too coarse for an operating agent. Suppose clinical-rules.yaml has moved. The command exits with status 2, and standard error says rules file not found. An agent that interprets every nonzero status as a transient execution failure may repeat the same command indefinitely. Structured output can identify the required response:

1
2
3
4
5
6
7
8
9


{
  "schema_version": 1,
  "status": "invalid_request",
  "error": {
    "code": "rules_not_found",
    "retry": "after_input_change",
    "message": "Rules file clinical-rules.yaml was not found"
  }
}

The message remains useful to a person. The stable code lets software branch without parsing prose, and retry tells the agent that waiting and rerunning unchanged will accomplish nothing. A network timeout would belong to a different result category with different retry guidance. Semantic errors turn recovery into part of the interface contract.

Discoverability can also be designed. Useful help text, shell completion, examples, validation of close misspellings, and predictable subcommands make a CLI easier to learn. GUI retains an advantage for browsing an unfamiliar visual domain, while a careful CLI can reveal its operational vocabulary through --help and contextual errors.

Automation also needs a noninteractive path. Missing required input should produce a structured failure instead of opening a prompt that blocks indefinitely. Human-oriented prompts can live behind an explicit interactive mode, while flags and configuration carry ordinary automation inputs and environment variables carry secrets. This keeps the same command usable in a terminal, CI job, or agent loop without requiring each caller to reconstruct hidden session state.

By this point, the CLI has done more than provide terminal access. It has forced the team to decide how the operation is named, called, observed, automated, and evolved.

The GUI And Agent Tool Become Adapters Instead Of Rival Designs

Once the command model exists, a GUI can focus on its own strengths. The dataset validator can offer a file picker, rule presets, progress feedback, a filterable violations table, and a link from each violation to the relevant row. Pressing Run constructs ValidateDataset. The result view renders one of the same declared outcomes used by the CLI.

The shared model prevents a common architectural drift in which the GUI develops a private version of business logic. If the interface filters invalid rules, supplies defaults, or changes error meanings independently, behavior begins to depend on the entry point. Centralizing the operation makes those differences visible. Adapter-specific behavior still has a place, especially for interaction state and presentation, while domain semantics remain shared.

An operating agent needs another projection. Its tool definition might expose input_path, rules_path, and dry_run through a JSON Schema. The tool handler validates the request, constructs the command, and serializes ValidationResult. This is conceptually close to the CLI adapter, though a direct library or service call may be operationally preferable.

Current coding-agent interfaces show the value of these contracts. Claude Code supports noninteractive invocation, piped input, and JSON output in its official CLI reference. OpenAI’s Codex non-interactive mode separates progress on standard error from final output on standard output, supports JSONL event streams, and can constrain final output with a JSON Schema. Codex also supports tools through the Model Context Protocol, which gives models named operations and structured arguments.

These products include rich terminal interfaces, yet their automation surfaces reveal the more durable idea. An agent benefits when an operation has a stable name, explicit parameters, bounded permissions, structured results, and observable failure. Whether the transport is a subprocess, MCP call, HTTP request, or in-process function is an engineering choice around that contract.

Human builders gain from the same arrangement. A command can reproduce an agent failure locally. A captured JSON result can become a fixture. A dry-run mode can expose intended work before mutation. Logs can record the operation and its outcome without recording a sequence of cursor movements. The shared command model gives humans and agents a common object to discuss during review.

Stata Taught Me That Commands Can Be Work Products

My preference for this style predates coding agents. Stata made the idea concrete because its commands are both interactive actions and material for a durable analysis.

Disclaimer: While I no longer use Stata actively, it was my go-to analytical tool during my doctoral studies. Today, my data engineering, analytics, and statistical workflows are built predominantly on the Python ecosystem. Nonetheless, I still deeply appreciate Stata’s ergonomics and design choices.

A short session might include:

1
2
3


regress y x1 x2
predict yhat
summarize yhat

Each line performs an operation, but the sequence also explains the analytical path. It can be placed in a do-file, reviewed by a collaborator, rerun on updated data, and incorporated into a larger report. Stata’s documentation emphasizes that it can run interactively, from command scripts, or in batch mode, while logs can preserve commands and results. Its reproducible reporting tools extend the same command-driven workflow into Word, PDF, HTML, and Excel output.

Full reproducibility still depends on declared files, controlled environments, random seeds, package versions, and the removal of hidden manual edits. The command representation creates a place for that discipline to live. A reviewer can see the operation, inspect its order, and identify hidden dependencies.

The same property appears in shell pipelines. Dennis Ritchie’s account of the evolution of Unix describes the development of pipelined commands, where one program’s output becomes another program’s input. Composition works because programs accept and produce representations that other programs can use. The shell command is simultaneously an instruction and a record of how data moved.

GUI applications can preserve histories too. Photoshop records actions, notebooks retain cells, and analytical platforms can export workflows. Their quality varies by product. The design lesson is broader than interface category: operational history is valuable enough to become a product artifact. Command-oriented systems tend to encounter this requirement early because their interactions already have a textual representation.

That representation also compresses communication. git rebase -i HEAD~5 identifies an operation and its scope more precisely than a narration of menus and clicks. A validation command in a bug report gives the recipient something executable. The advantage comes from shared vocabulary and explicit parameters rather than brevity alone.

AI Changes The Economics Around The Interface

GUI implementation once imposed a large upfront cost on many small tools. A team had to build layouts, forms, state management, validation feedback, packaging, and platform-specific behavior before users could experience a coherent workflow. That cost encouraged teams either to commit to the GUI early or to leave the tool accessible only through code.

Coding agents and modern UI frameworks have reduced part of that implementation burden. They can generate forms, tables, settings screens, and ordinary state wiring quickly. I have seen this in my own package work: once the operations and data structures are clear, producing a plausible presentation layer takes less effort than it used to.

The harder design work often sits beneath the generated interface. Someone still has to decide whether validation is one operation or several, whether partial success is meaningful, how cancellation behaves, which defaults are safe, and what downstream consumers may rely on. Generated screens can conceal unresolved answers behind controls that look finished.

This changes where I want to spend scarce design attention. I would rather establish the operational model early, exercise it through a CLI, and add presentation layers once the semantics have survived real use. A GUI built over a coherent command model has less domain ambiguity to absorb. It can concentrate on interaction quality, accessibility, visual feedback, and the needs of the people using it.

The same sequence improves delegation to a coding agent. “Build a validation screen” leaves the agent to infer both domain behavior and presentation. “Build a screen that constructs this command and renders these result variants” supplies a bounded interface. The generated code still needs review, but the agent has fewer semantic decisions to invent.

CLI-First Has A Boundary

The command model fits operation-oriented software especially well: developer tools, data pipelines, build systems, deployment utilities, batch analysis, administrative workflows, and packages whose main behavior can be expressed as named transformations.

Visual domains place the boundary elsewhere. Cropping an image, laying out a circuit, rotating a three-dimensional model, and brushing points in a scatterplot depend on continuous spatial feedback. Commands may support automation around those tasks, while direct manipulation remains the primary working interface.

An API may also be the better first adapter. A high-throughput service should avoid process startup and text serialization on every request. Remote access needs authentication, authorization, rate limits, and transport-level behavior beyond the scope of a local CLI. An embedded library needs types and memory representations native to its host language. In these cases, the command model can still organize operations even when the shell interface arrives later or remains a diagnostic tool.

Accessibility requires similar care across both visual and textual interfaces. Keyboard navigation, screen-reader semantics, color choices, focus management, output structure, and cognitive load all depend on implementation and audience. The command model helps keep domain behavior consistent, while each adapter carries its own accessibility work.

Long-running interactive systems can also strain a command abstraction if the model treats every action as an isolated request. Editors, games, collaborative canvases, and monitoring consoles maintain rich state over time. They may need events, sessions, subscriptions, undo models, or synchronized documents. Forcing those interactions into a collection of stateless verbs can produce an awkward design.

These boundaries refine the heuristic. CLI-first works when a product has meaningful operations that benefit from explicit invocation, replay, and composition. The command model remains useful across a wider range of systems, but it should follow the domain instead of imposing shell-shaped behavior on every interaction.

The Preference Is Really About Shared Semantics

Return to the dataset package before its first interface. The team still needs to decide what validation means, which requests are valid, what results can occur, and how callers should respond. A file picker and a Run button will eventually help many people use the package. An API may support a service. An agent tool may let an automated workflow invoke it safely.

Designing the command model first gives those interfaces a shared semantic center. The CLI makes that center executable early, with few presentation concerns and strong support for inspection, scripts, and tests. It also leaves behind a representation of work that humans and agents can communicate, rerun, and review.

GUI became the dominant interface because direct manipulation solved the central human-computer interaction problem of its era. Human operators remain central today. They are now joined by human builders, coding agents, and operating agents, all meeting at the software boundary with different needs.

That expanded audience explains why CLI-first design keeps reappearing in my work. I am choosing explicit operations before presentation, stable results before polished rendering, and a shared contract that several interfaces can carry. The terminal is simply where that contract often becomes real first.

References

B. Shneiderman, Direct Manipulation: A Step Beyond Programming Languages (1983), IEEE Computer.
D. M. Ritchie, The Evolution of the Unix Time-Sharing System (1984), AT&T Bell Laboratories Technical Journal.
The Open Group, Utility Conventions (2018), POSIX.1-2017.
GNU Project, Standards for Command Line Interfaces, GNU Coding Standards.
StataCorp, Truly Reproducible Reporting.
Anthropic, Claude Code CLI Reference.
OpenAI, Codex Non-Interactive Mode.
OpenAI, Model Context Protocol.

The First Command Exposes The Real Design Problem#

GUI Solved The Dominant Interface Problem Of Its Era#

Four Participants Now Share The Interface#

A Command Model Makes Operations Concrete#

The Contract Also Covers Effects And Time#

The CLI Turns That Model Into An Executable Contract#

The GUI And Agent Tool Become Adapters Instead Of Rival Designs#

Stata Taught Me That Commands Can Be Work Products#

AI Changes The Economics Around The Interface#

CLI-First Has A Boundary#

The Preference Is Really About Shared Semantics#

References#