How to Build a Good NLP Baseline When Small Cues Matter

Good NLP starts with reading the text: define the task, keep meaning-bearing words, then build simple, inspectable baselines.

Clinical notes could be tricky

Some Text Problems Punish Shallow Reading

Suppose you need to flag notes that describe active chest pain right now.

The task sounds easy until the notes arrive.

Reports chest pain x2d, worse on exertion.
Denies chest pain; shortness of breath improved.
Family history of CAD. No active chest pain today.
Chest pain improved after nitroglycerin.
Cannot rule out infection.
No known drug allergies.
SOB with exertion.
Family hx diabetes.

These fragments share vocabulary, but not meaning. Some describe active symptoms. Some are negated, historical, or about family history. Some express uncertainty. Some hide ordinary clinical meaning inside abbreviations. A model that treats them as ordinary word overlap will make confident mistakes fast.

That is the point: when small cues flip the label, the baseline starts before feature engineering. It starts with the target. What counts as positive? What counts as negative? Which mistakes matter operationally? Which phrases actually carry the distinction?

Clinical text makes a familiar set of NLP concerns unusually easy to see:

negation: denies chest pain, no active chest pain today
temporality: chest pain improved, chest pain yesterday
attribution: family history of diabetes
uncertainty: cannot rule out infection
abbreviations: SOB, hx, CP
templated language: repeated headers, signature lines, copied forward sections
de-identification artifacts: [**Name**], [**Hospital**], redacted dates
workflow cost: missing a true case and over-flagging a benign case do not cost the same thing

That is what makes clinical NLP such a useful training ground. The cues are local, the stakes are visible, and a sloppy baseline teaches the wrong lesson.

The same habits transfer well beyond medicine. Legal text, policy memos, scientific abstracts, adverse event reports, support tickets, and incident reviews often have the same structure: a few small phrases carry negation, scope, temporality, attribution, or uncertainty, while abbreviations and boilerplate try to drown them out. Clinical notes are the example I know best, not the only place this workflow matters.

I recently used this progression in an in-house workshop series, and the public materials are available in penn-nlp-workshop-public. This post is the deeper written version of the baseline part. It is meant for anyone who wants a better first pass on clinical text, whether or not they saw the workshop.

The path through the post is simple: observation first, cleaning second, sparse baselines third.

Observation Is The Text Version Of Exploratory Analysis

In many quantitative workflows, we do not begin by fitting the fanciest model we can afford. We start by looking at the data. We inspect distributions, missingness, label balance, leakage, and obvious artifacts before we commit to a modeling story.

Clinical NLP needs the same discipline, and so do many other text problems with dense local cues. Observation is the text version of exploratory analysis.

For the chest-pain example, observation means reading raw note fragments and deciding what the task actually is. If the target is “active chest pain now,” then these should not be treated alike:

Reports chest pain x2d, worse on exertion. -> positive
Denies chest pain; shortness of breath improved. -> negative
Chest pain improved after nitroglycerin. -> usually negative for “active now”
Family history of CAD. No active chest pain today. -> negative
Possible chest pain overnight, now resolved. -> likely negative, but worth discussing
Cannot rule out ACS; chest discomfort persists. -> ambiguous unless the label policy is explicit

This is where the baseline gets its first real advantage. Reading raw text before normalization tells you which language cues are carrying the target:

words that negate: no, not, without, denies
phrases that shift attribution: family history, mother had, past medical history
phrases that shift time: today, yesterday, resolved, prior
abbreviations that must be translated before counting: CP, SOB, hx
repeated note fragments that may overwhelm the vocabulary without helping the task

Observation is also where workflow enters the problem. In some screening settings, missing a real case of active chest pain is costlier than sending an extra note for review. In other domains, the tradeoff may be between escalating too many support tickets, over-flagging adverse events, or misrouting scientific records. A baseline is “good” only relative to the actual decision process it is meant to support.

Less glamorous than modeling, yes, but this is the work that makes later modeling interpretable.

A Good Baseline Starts Before The First Fit

Before a vectorizer is fitted or a model is compared, you can already tell whether a baseline is taking shape in a useful way.

At this stage, “good” does not mean high accuracy. It means the task has become concrete enough that you can explain it, audit it, and predict where it will fail.

In plain English, a good pre-model baseline usually has five properties:

the label policy is explicit enough that two careful readers would classify the same example similarly
the ambiguous cases are known rather than discovered accidentally after training
the raw text artifacts have been inspected instead of treated as generic noise
the cleaning policy has a reason behind it
the likely failure patterns are already visible in the notes

That gives you a practical evaluation checklist before modeling:

Read a small but varied sample of raw texts from across labels, sources, or services.
Write down the edge cases that keep changing the interpretation: negation, family history, temporality, uncertainty, copied boilerplate, abbreviations.
List the words and phrases that must survive cleaning if the task is to remain visible.
Decide what unit is being classified: whole note, note section, message, sentence, or extracted snippet. Later, when the notation says “document,” it means this chosen unit.
Freeze the train, validation, and test split strategy before vocabulary building, and fit any data-driven vocabulary on the training split only so leakage does not enter through preprocessing.

In scikit-learn, this is the stage where you decide what counts as one row and what preprocessing happens before CountVectorizer or TfidfVectorizer ever sees the text.

If you cannot yet say what should count as positive, what should count as negative, what should remain ambiguous, and which textual cues carry those distinctions, you are not really ready to evaluate a model. You are still evaluating the task definition.

Cleaning Should Preserve What Flips The Label

Cleaning is often described as if it were a neutral preparation step. In clinical text, it is already part of the model.

Take a raw fragment like this:

1

[**Name**] reports CP x2d; denies SOB. Family hx CAD.

One cleaned version might be:

1

<redacted> reports chest pain 2d denies shortness of breath family history cad

That line looks simple only because many choices have already been made.

You expanded CP to chest pain. You expanded SOB to shortness of breath. You decided that hx means history. You replaced a de-identification marker with a stable placeholder. You lowercased the text, removed punctuation, and collapsed repeated whitespace. Every one of those operations changes what the model can count.

Some cleaning decisions are usually helpful in this setting:

expand clinically meaningful abbreviations before feature extraction
standardize de-identification markers to a small set of placeholders
normalize whitespace and obvious formatting noise
lowercase when case is not part of the task
consider preserving certain phrases as units when they are known to carry meaning

Other choices need more caution:

removing punctuation before you know whether section boundaries matter
deleting repeated headers before checking whether they correlate with note type
stripping dates and time words when temporality matters
collapsing every abbreviation the same way across specialties
applying a generic stop-word list without reading what it removes

In settings like this, the default cleaning principle should be conservative: preserve any cue that could plausibly flip the label, then simplify only after inspection.

A Better Stop-Word Strategy Starts Conservative

The usual warning is that generic stop-word lists can be dangerous in medical text. That warning is correct, and the same caution applies in other domains where function words and short phrases carry the label. A good baseline needs a better replacement strategy.

Consider the sentence:

1

No active chest pain today.

A generic English stop-word list may decide that no and today are unimportant. The result is a representation that over-emphasizes active, chest, and pain while suppressing the words that tell you the symptom is absent now. The same problem appears in:

Denies chest pain
Without shortness of breath
Cannot rule out infection
Family history of diabetes
Chest pain resolved

In all of these, the clinically important difference is often carried by short function words or short phrases. Removing them does not “denoise” the note. It erases the distinction the task depends on.

My own recommendation for a first clinical baseline is simple:

Start with no automatic stop-word removal.
Expand abbreviations and normalize placeholders first.
Protect meaning-bearing cues explicitly: no, not, without, denies, family, history, today, prior, resolved, rule out.
Remove or collapse only those high-frequency artifacts that you have already inspected and judged irrelevant to the task.
Compare model behavior with and without any custom stop-word policy before you keep it.

For the chest-pain task, a safer custom policy might look like this:

keep negation and temporality words
keep attribution words such as family and history
map cp to chest pain, sob to shortness of breath, hx to history
replace [**Name**] and similar markers with <redacted>
optionally collapse repeated administrative boilerplate or section headers into a small marker set if they dominate the corpus and are demonstrably unrelated to the label

That strategy does more work up front, but it aligns with the actual problem. In any domain where small function words or short phrases carry the decision boundary, a stop-word policy should be treated as a task-specific modeling choice, not borrowed from generic English preprocessing recipes.

Sparse Baselines Need A Little Notation

Before going deeper into sparse features, it helps to define a few symbols that appear repeatedly in NLP papers and libraries. Two pieces of terminology usually cause confusion here: what counts as a “document,” and where the vocabulary comes from.

Start with the plain-English picture.

In NLP notation, a document is just one unit of text that becomes one row of your feature matrix. It does not have to mean a full physical document. If you classify whole notes, then each note is a document. If you classify sentences, then each sentence is a document. If you classify extracted snippets around a symptom mention, then each snippet is a document.

The corpus is just the collection of those text units. In this post, the units are usually clinical notes or extracted snippets, but the same setup works for reports, tickets, abstracts, or messages.

Formally, let the corpus be

$$ D = \{d_1, d_2, \ldots, d_n\} $$

where each $d$ is one chosen text unit for the task.

The vocabulary is the list of terms that get tracked as features. A term may be a single word or a short phrase.

That list is not usually assumed in advance. For a first sparse baseline, the practical workflow is usually:

choose the unit of classification
freeze the train, validation, and test split
apply the cleaning and abbreviation-expansion policy
fit the vocabulary on the training texts only
reuse that fitted vocabulary for validation and test texts

A small normalization dictionary is often worth defining by hand before this step. Mapping cp -> chest pain, sob -> shortness of breath, and hx -> history is very different from hand-writing the entire modeling vocabulary. In most first baselines, the normalization rules are partly hand-built, but the vocabulary itself is learned from the cleaned training corpus.

In practice, a first pass is often CountVectorizer(ngram_range=(1, 2)) or TfidfVectorizer(ngram_range=(1, 2)) fit on the training split, followed by LogisticRegression or LinearSVC so the learned weights stay easy to inspect.

For example, if your cleaned training snippets are:

reports chest pain today
denies chest pain
family history cad

then a simple unigram vocabulary learned from those training texts might contain

reports
chest
pain
today
denies
family
history
cad

If you allow short phrases, that vocabulary might also include terms such as chest pain and family history.

A useful rule of thumb is to let the training data define most of the vocabulary, while you define only the normalization map and a small number of must-keep phrases. I would import a fixed external vocabulary only when the task already depends on a controlled lexicon or ontology and you are willing to miss informal phrasing that never made it into that list.

Formally, let the vocabulary be

$$ V = \{t_1, t_2, \ldots, t_p\} $$

where each term $t$ is one element of that tracked vocabulary.

Now think about the simplest measurable quantity: how many times does a term appear in a text?

Formally, the raw count of term $t$ in document $d$ is

$$ c(t, d) $$

and the number of documents, meaning text units, containing that term is called the document frequency:

$$ \text{df}(t) = |\{d \in D : c(t, d) > 0\}| $$

Once you choose a vocabulary and a weighting scheme, each document, meaning each text unit, turns into a feature vector.

Formally, the feature vector for document $d$ is usually written as

$$ x_d = (x_{d,1}, x_{d,2}, \ldots, x_{d,p})^T $$

where each component corresponds to one vocabulary term.

One more quantity shows up repeatedly in the next sections. In plain English, term frequency asks: how prominent is this term inside this document?

The most direct translation to math is a within-document proportion:

$$ \text{tf}(t, d) = \frac{c(t, d)}{\sum_{t' \in V} c(t', d)} $$

This turns raw counts into within-document proportions. Different libraries make slightly different choices here, but the core idea stays the same: how prominent is this term inside this document?

When inverse-document-frequency weighting appears later, I will write it as $\text{idf}(t)$. In plain English, that quantity is meant to reward terms that are relatively rare across the corpus. I will first describe the simple ratio behind it, then show the log-transformed and smoothed form that is used more often in practice.

These symbols are enough for the rest of the discussion.

Why Counting Features Still Earn Their Place

Sparse count-based features can look old-fashioned next to embeddings and transformers. That is a poor reason to skip them.

They earn their place for at least four reasons.

First, they are inspectable. You can see which words and phrases exist in the vocabulary, which terms receive high weights, and which terms appear in model errors.

Second, they are fast. That matters when you are iterating on label definitions, cleaning policies, and extracted snippets rather than only on model architecture.

Third, they preserve lexical visibility. If the task depends on denies, family history, or resolved, sparse features make it obvious whether those cues survived preprocessing at all.

Fourth, they play the role that descriptive analysis often plays in other quantitative work. They tell you whether the problem is driven by clear local language, by label noise, by boilerplate, or by cross-sentence reasoning that the baseline cannot capture.

A good baseline sets a lower bound on performance and makes the problem legible.

Bag-Of-Words Shows The Lexical Surface

Start with the simplest case. Let the vocabulary contain only single words. Then the bag-of-words representation is just the count vector

$$ x_d^{\text{bow}} = (c(t_1, d), c(t_2, d), \ldots, c(t_p, d))^T $$

If the vocabulary is

$$ V = (\text{reports}, \text{denies}, \text{chest}, \text{pain}, \text{history}, \text{today}) $$

then the note

1

Denies chest pain today.

might map to

$$ x_d^{\text{bow}} = (0, 1, 1, 1, 0, 1)^T $$

This is already useful. It surfaces whether words associated with negation, attribution, or timing are even visible to the model. For an early baseline, that visibility matters more than elegance.

Bag-of-words also teaches you where the problem becomes blurry. The following notes still overlap heavily in single-word space:

Active chest pain today
Chest pain improved today
Family history of chest pain

All three contain chest and pain. A single-word baseline can distinguish some of them if other words survive cleaning, but it weakens the local phrase structure that clinicians actually read.

That is why bag-of-words is a good first baseline and an incomplete one. It exposes the lexical surface of the task. It does not preserve much local composition.

Short Phrases Repair Local Meaning

The next move is to let the vocabulary include short phrases in addition to, or instead of, single words. In NLP terms, these consecutive sequences are called n-grams.

For example, the phrase vocabulary might include:

chest pain
denies chest pain
family history
rule out infection
shortness of breath

This matters because many clinically meaningful distinctions are local.

Denies chest pain carries different evidence from the separate words denies, chest, and pain. Family history is a stronger attribution cue than either word alone. Cannot rule out infection expresses uncertainty in a way that isolated single words do not.

In notation, nothing fundamental changes. The vocabulary $V$ simply contains phrase-level terms as well. The count function $c(t, d)$ then counts phrases when $t$ is a phrase.

For the chest-pain task, short phrases repair some of what bag-of-words loses:

no active helps preserve negation
chest pain keeps the symptom concept together
family history preserves attribution
pain improved preserves a local temporal cue

This is often enough to improve a baseline substantially. It is still local. If the real cue depends on longer reasoning across sentences, sparse phrases will struggle:

Chest pain yesterday. No chest pain today.
Mother had diabetes. Patient denies diabetes.
Rule out ACS in the differential, but current pain resolved after treatment.

The implementation can be perfectly sound and still miss these cases because the representation is too local in scope.

TF-IDF Reweights The Same Vocabulary

TF-IDF is often introduced as though it were a separate kind of text model. A cleaner way to think about it is as a weighting scheme applied to a vocabulary you have already chosen.

Start with the plain-English goal. We want a term to matter more when it is prominent in this note and less when it appears in almost every note in the corpus.

The within-note part is term frequency. The most direct mathematical translation is the proportion of all tracked terms in $d$ that belong to $t$:

$$ \text{tf}(t, d) = \frac{c(t, d)}{\sum_{t' \in V} c(t', d)} $$

This gives a within-document weight. Terms that occur more often in the note receive larger values.

Now move to the across-corpus part. In plain English, a term should get less extra credit if it appears in nearly every document. A direct translation of that idea is the rarity ratio

$$ \frac{|D|}{\text{df}(t)} $$

which grows when a term appears in fewer documents.

That direct ratio is useful for intuition, but it is rough in practice. It can grow too aggressively for very rare terms, and it behaves awkwardly at boundary cases. That is why most implementations refine it with smoothing and a log transform:

$$ \text{idf}(t) = \log\left(\frac{|D| + 1}{\text{df}(t) + 1}\right) + 1 $$

The $+1$ terms smooth the ratio so the extremes are less brittle. The logarithm compresses the gap between rare and very rare terms. The final $+1$ keeps the weight from collapsing to zero in the common formulation used by many libraries.

The TF-IDF weight is then

$$ x_{d,t}^{\text{tfidf}} = \text{tf}(t, d) \cdot \text{idf}(t) $$

So the final intuition is straightforward: a term matters more when it is prominent in this note and relatively uncommon across the corpus.

For clinical notes, and for other corpora with repeated boilerplate, this can help when generic language dominates the corpus. Terms like patient, follow up, or repeated template fragments may appear so often that raw counts overstate their value. Terms like nitroglycerin, radiating, or exertion may receive more emphasis if they are rarer and task-relevant.

Used well, TF-IDF can make the baseline less sensitive to bland note boilerplate. Used carelessly, it can also boost the wrong things.

TF-IDF Can Sit On Top Of Words Or Phrases

One confusion shows up often in introductory NLP discussions. People compare “bag-of-words” and “TF-IDF” as if they were parallel feature families. That framing hides an important design distinction.

There are really two separate choices:

What goes into the vocabulary?
How are the vocabulary terms weighted?

The vocabulary may contain only single words. It may contain short phrases. It may contain both. TF-IDF can be applied in all of those settings.

That means all of the following are valid:

unigram counts
unigram TF-IDF
bigram counts
bigram TF-IDF
mixed word-and-phrase TF-IDF

This matters in practice. If short phrases such as family history or denies chest pain are clinically important, the useful design move may be to add phrase features and then apply TF-IDF, instead of treating phrase choice and TF-IDF as competing options.

Vocabulary design and weighting should be treated as separate decisions.

TF-IDF Fails When Its Assumptions Fail

TF-IDF is often useful, but it carries assumptions. Clinical text makes the failure modes easy to see, and the same issues appear in many other domains with boilerplate, domain drift, or long-range context.

One assumption is that rarer terms are usually more informative. Sometimes that is true. Sometimes it is badly misleading. In a cardiology-heavy corpus, chest pain may be common precisely because it is central to the task. A low inverse-document-frequency weight does not make it clinically unimportant.

Another assumption is that repeated mentions inside a note deserve extra emphasis. Templates can break that logic. If a copied section repeats a term several times, TF-IDF may reward repetition that reflects documentation style rather than patient state.

A third assumption is that corpus-wide statistics are stable enough to be trustworthy. In small institutional corpora, across-specialty corpora, or drifting note collections, document frequency can change for reasons that have little to do with the classification target. A term that looks rare and decisive in one service line may be routine in another.

A fourth assumption is that local lexical evidence is enough. Clinical text often asks for more. Consider:

Cannot rule out infection
Chest pain yesterday. No chest pain today.
Family history of CAD. Patient denies chest pain.

The first requires reading uncertainty correctly. The second requires temporality across clauses. The third requires attribution and local negation together. TF-IDF may weight infection, chest pain, or CAD strongly while still missing the clinical interpretation that matters.

There is also a simpler failure mode. Some words are common and still crucial. No, denies, today, prior, and history may all appear frequently enough to receive modest inverse-document-frequency weights. That does not reduce their importance for the task. It only means that corpus rarity is not the same thing as clinical importance.

TF-IDF works best when it is treated as a useful reweighting heuristic. It does not guarantee that the most clinically important terms will float to the top automatically.

A Good Baseline Is A Reusable Decision Process

By the time you fit the first model, most of the baseline has already been decided.

A good workflow usually looks like this:

Freeze the train, validation, and test split before fitting the vocabulary or inverse-document-frequency statistics.
Compare a small number of representations on the same split: single-word counts, short-phrase counts, and one or two TF-IDF variants.
Fit simple linear or count-based models that make feature inspection easy.
Read the highest-weighted positive and negative features.
Read false positives and false negatives in raw text, grouped by pattern.
Decide whether the next move should be better cleaning, better labels, better phrase features, or a more expressive model.

A decent benchmark is useful, but the deeper value of a baseline is that it exposes what the task is made of.

That is also why I would keep threshold tuning and workflow cost near the end of this process rather than at the very beginning. Those decisions matter, especially in high-stakes settings, but they matter more after you understand what the representation is actually capturing.

In an earlier essay, Why NLP Still Matters in the Age of AI Agents, I argued that language remains part of the system rather than a thin wrapper around it. A good baseline is where that claim becomes operational. It shows which cues survived cleaning, which phrases deserve to stay intact, which weights help, and which errors are asking for better labels rather than a larger model.

Clinical text is just the running example. The broader takeaway is this: when small cues flip the label, the first job of a baseline is to make those cues visible. If it cannot do that, a higher score will not teach you much.

Some Text Problems Punish Shallow Reading#

Observation Is The Text Version Of Exploratory Analysis#

A Good Baseline Starts Before The First Fit#

Cleaning Should Preserve What Flips The Label#

A Better Stop-Word Strategy Starts Conservative#

Sparse Baselines Need A Little Notation#

Why Counting Features Still Earn Their Place#

Bag-Of-Words Shows The Lexical Surface#

Short Phrases Repair Local Meaning#

TF-IDF Reweights The Same Vocabulary#

TF-IDF Can Sit On Top Of Words Or Phrases#

TF-IDF Fails When Its Assumptions Fail#

A Good Baseline Is A Reusable Decision Process#