A TF-IDF vignette for designing around data flow, types, and contracts before code.
In my November 9, 2025 post, Functional Programming in Python, I argued that FP is less about adopting a pure functional language and more about adopting a different engineering discipline. The core habits were simple: think in expressions, prefer immutability, isolate side effects, and model your domain explicitly.
That line of argument tends to invite a practical question. If a team gives you a vaguely defined task on Monday morning, where does the work begin, and what does “FP-style” look like before the code exists?
The essay answers it with a concrete workflow. The example is a small TF-IDF vectorizer in Python. I chose it because it is simple enough to hold in your head while still exposing the engineering choices that matter: data flow, type boundaries, interface design, edge cases, and tests.
The code below is illustrative rather than production-ready. I am using the vectorizer as a compact engineering vignette: a modest but real task where the design can stay readable, testable, and hard to tangle.
The Real Work Starts Before The First Class Definition
Suppose I am an engineer on a small ML platform team who is asked to build a TF-IDF vectorizer for an internal document pipeline.
The request is not glamorous. A few teammates want an inspectable baseline before they reach for embeddings. They need something that can fit on one corpus, transform another, and plug into downstream experiments without a lot of hidden behavior. It should support fit, transform, and fit_transform. It should behave sensibly on empty documents and unseen tokens. It should be reusable in experiments and easy to test. No one has written a full design. No one has listed every edge case. The task is just specific enough that you can start coding immediately and just vague enough that you can create a mess quickly.
The tempting move is familiar: open a file, write a TfidfVectorizer class, add a few attributes such as vocabulary_ and idf_, then fill in methods one by one until the tests stop failing. Many Python codebases grow this way.
|
|
There is nothing inherently wrong with a class. The problem is timing. If I start with this shape too early, the design pressure shifts toward “what methods does the class need?” instead of “what data is moving through the system?” That encourages state to accumulate before the domain is clear. Tokenization logic, vocabulary policy, numerical conventions, and edge-case behavior start leaking into instance state and half-defined helper methods. The code may still work, but it becomes harder to reason about because the structure was discovered accidentally.
FP-style work starts one step earlier. I want to know what the system is supposed to do, what data artifacts it produces, and which transformations connect them. Once those are explicit, the implementation language and outer API become much easier to choose.
A Small Requirement List Is More Valuable Than Premature Code
Before I define functions, I want a short list of needs and a few representative use cases.
For this vignette, the requirements are compact:
- fit a vocabulary and IDF statistics from a training corpus
- transform new documents using the fitted vocabulary
- keep transform-time behavior deterministic for unseen tokens
- handle empty documents without crashing
- expose a surface that is easy to test in pieces
Then I translate those needs into small concrete scenarios:
- fit on corpus A, then transform corpus B with the same vocabulary
- ignore tokens at transform time that were never seen during fit
- preserve repeated terms within a document so TF can change
- allow an empty string as a document without producing malformed state
- guarantee that identical input documents map to identical vectors under the same fitted model
The step sounds trivial, but it changes the next hour of work. The use cases constrain the design before the code starts improvising. For example, if unseen tokens must be ignored at transform time, then transform cannot quietly mutate the vocabulary. If empty documents are valid inputs, then the normalization logic must define what “empty” means instead of leaving that decision to whatever helper function gets written first.
That is one of the habits I associate with FP-style engineering in Python. You do not begin with mutable machinery and hope the right behavior emerges. You force the behavior to become legible first.
The order is useful, but it is not a waterfall. Once I sketch types, signatures, or tests, I often loop back and revise the requirements or redraw the pipeline.
The Problem Should Become A Pipeline Before It Becomes An API
Once the basic use cases are visible, I stop thinking about classes and start thinking about data flow.
A TF-IDF vectorizer is not mysterious. It is a sequence of transformations over text:
|
|
The little diagram does more work than a half-written object model. It tells me what intermediate artifacts exist. It suggests where responsibilities should split. It also reveals which steps depend on corpus-level context and which are per-document transformations.
From there, the module shape becomes easier to sketch:
- tokenization
- term counting
- vocabulary construction
- document-frequency / IDF computation
- vector transformation
FP-style thinking helps most here. The primary abstraction is the movement of values through transformations, not the accumulation of methods around a mutable object. The system becomes easier to test because each stage can be inspected. It becomes easier to change because the interfaces between stages are narrower. If I later swap tokenization rules or introduce sparse storage, I know roughly where the change belongs.
It also reduces a common source of confusion in Python projects: mixing “fitted model state” with “operations that compute new values from that state.” Those are related, but they are not the same thing. A fitted vectorizer is one artifact in the pipeline. It is not the pipeline itself.
Types Turn Half-Decided Ideas Into Concrete Objects
After the data flow is clear, I start naming the domain objects. This is usually where the design becomes much more stable.
For a small vectorizer, I might start with something like this:
|
|
I do not need perfect types at this stage. I need useful pressure. Once these names exist, vague ideas become harder to smuggle through the design.
VectorizerConfig tells me configuration should be explicit, not hidden in scattered conditionals. Vocabulary tells me the token-index mapping is its own domain object, not a random dictionary passed around with no semantic label. DocumentFrequency distinguishes corpus statistics from per-document term counts. FittedVectorizer makes the fitted state concrete: it is a value produced by fit, not a cloud of attributes living wherever the class happens to mutate them.
Design problems often surface here before they are buried in implementation. Maybe Vocabulary needs to preserve ordering guarantees. Maybe idf_weights should not be a naked dictionary if the vocabulary and weights must stay aligned. Maybe min_df changes the meaning of fit enough that I want a separate normalization step. The point is that the problems show up before I have buried them inside implementation details.
In Python, type hints and frozen dataclasses are not a theorem prover. They are still worth using because they force the domain to appear on the page. They make contracts visible to both the reader and the tooling.
Signatures Reveal Whether The Design Actually Composes
The next step is one of my strongest habits: define function signatures before writing bodies.
If the types are the nouns of the system, the function signatures are the verbs and prepositions. They show how the pieces connect.
|
|
Notice what I am delaying: implementation cleverness. I do not yet care whether tokenization uses regexes or a custom splitter. I do not care whether the output vectors are dense lists, sparse dictionaries, or NumPy arrays. I care whether the composition makes sense.
Several useful questions become obvious at the signature level.
Should fit accept raw strings or pre-tokenized documents? Should transform return a domain type such as SparseVector instead of a raw dictionary? Does compute_document_frequency operate on all tokens or only tokens retained in the fitted vocabulary? Should count_terms know anything about configuration, or should normalization have happened earlier?
These questions are good news. They are precisely the questions that become painful when discovered late.
I also like this stage because it keeps the implementation honest. A bad signature forces downstream awkwardness. A clean signature often means the code that follows can remain small. When the interfaces are explicit, composition becomes legible and testable instead of remaining a slogan.
Tests Should Clarify Behavior Before They Protect Code
Once I have types and signatures, I usually write a few unit tests or test-like examples before implementing much.
My view of TDD here is selective rather than devotional. What it gets right is forcing behavior and contract decisions early, while the design is still cheap to change. What it gets wrong, when used dogmatically, is nudging teams toward testing unstable internals or treating “write the test first” as a substitute for deciding the shape of the system.
For this vectorizer, the first scenarios I care about are:
- two identical documents should yield identical vectors under the same fitted model
- an unseen token at transform time should not mutate the fitted vocabulary
- repeated terms within a document should increase that term’s TF contribution
- rarer terms across the corpus should receive higher IDF weights than very common terms
- an empty document should produce a valid, mostly zero vector rather than an exception
- fitting on the same corpus twice should produce the same vocabulary ordering if determinism matters
Even a small sketch is useful:
|
|
Writing that test tells me something important: transform should be observational only with respect to fitted state. It consumes a FittedVectorizer; it does not revise one. That may sound obvious, but many hurried implementations violate that kind of boundary by accident.
I also want one test that makes the convenience surface prove its honesty:
|
|
That kind of test does two useful things. It locks in the contract for fit_transform, and it catches a common design failure early: convenience helpers that slowly accumulate their own policy until they stop being thin wrappers and start becoming shadow implementations.
At this stage, tests behave like executable specifications. They pin down what the contracts mean in situations that matter. That is valuable even if the implementation changes shape several times.
By The Time Coding Starts, Most Of The Design Work Is Already Done
Now I implement.
The key difference is that implementation no longer carries the burden of discovering the system. It is assembling pieces whose roles are already visible.
I would usually build in this order:
- token normalization and tokenization
- per-document term counting
- corpus-level vocabulary and document-frequency computation
- IDF calculation
- document transformation
- outer API helpers such as
fit_transform
At each step I still prefer small, mostly pure functions. If side effects are needed, I keep them at the edge. If caching or serialization becomes useful later, I want that choice to be explicit. If numerical conventions change, I want the changes localized to a few transformations rather than smeared across instance methods that all quietly edit shared state.
Composition earns its keep here. fit can be a thin orchestration function that tokenizes the corpus, builds a vocabulary, computes document frequency, then packages the result into FittedVectorizer. transform can reuse the same tokenization path and only depend on the fitted artifact it receives. One honest version of fit_transform is almost boring: compute fitted = fit(documents, config) once, then return fitted, transform(documents, fitted).
That style is less glamorous than clever abstraction. It is also easier to maintain.
One practical note for Python teams: tooling helps reinforce the discipline. mypy catches mismatches between the interfaces you thought you designed and the ones your code is actually using. Pydantic can help validate boundary objects such as configuration or external inputs when the data enters the system. I would not build the whole vectorizer around Pydantic models, but I would gladly use it where runtime validation matters. Static checks and runtime validation solve different problems, and together they make this style much more comfortable in Python.
Legibility Pays Off When The System Grows
The next step after a vignette like this is usually less philosophical and more operational. You start asking whether the vectors should stay sparse, whether the fitted artifact should serialize cleanly between training and inference jobs, whether tokenization belongs in a shared boundary layer, and whether batching or caching changes the surface area of the module.
Those pressures are where an early design either pays off or starts to fray. If the boundaries are vague, fit-time and transform-time logic drift apart, convenience helpers quietly duplicate policy, and integration code starts mutating state because no one established a cleaner place to put the change. The system still works, but it becomes much harder to see what is stable and what is accidental.
That is why I like FP-style discipline in Python. It does not eliminate compromise. It makes later compromises easier to make intentionally. When the design is legible early, you can add sparse storage, serialization, caching, batching, or a class-based facade without losing track of what the system is actually doing.