Introduction to aTPR: Correcting for Risk Heterogeneity in Fairness Evaluation

We disentangle population risk from model-specific performance when subgroups enter with different risk profile.

aTPR idea illustration (Core idea of the adjusted TPR; some image components generated by NanoBanana 2)

In part 1, we saw that a raw true-positive-rate gap can be a mirror of baseline risk differences, not just model behavior. In part 2, we make that distinction operational and show how to measure fairness once those risk differences are held constant.

In clinical predictive modeling, fairness questions often arrive as a single number:

“Are the TPRs similar across groups at the threshold we use?”

That is a reasonable first question. In many governance meetings, it is also the only one that gets asked.

The problem is that a raw TPR comparison bundles together two different things:

how the model behaves at a given risk level, and
how risk is distributed in the population before the model ever makes a decision.

From part 1, the key insight was that groups can have different risk distributions before a model is even applied. If that happens, then even a model that is equally calibrated and equally separable by risk can still produce unequal TPRs at a fixed threshold.

Here is a realistic version of that problem. Suppose a health system uses a 6-month mortality model to suggest palliative care review for patients admitted with advanced cancer, heart failure, or COPD. One subgroup contains more patients who arrive after repeated hospitalizations, active weight loss, and functional decline, so many of their predicted risks sit around 0.30 to 0.50. Another subgroup contains more patients who are still clinically fragile, but present earlier in their disease course, with many scores around 0.10 to 0.25. Even if the model treats two patients with the same true risk similarly, the raw TPR can still be higher in the first subgroup simply because more eventual deaths are concentrated near or above the intervention threshold.

So the practical follow-up is: what should fairness evaluation compare?

Not raw performance in each group.

Conditional performance in a shared risk world.

This post introduces adjusted TPR (aTPR) as a concrete way to do that.

Why the raw TPR is an entangled quantity

Recall the subgroup TPR definition:

$$ \text{TPR}_s(\tau)=\Pr\big(g(X)>\tau\mid Y=1,S=s\big) $$

with decision rule $\tau$ and subgroup indicator $S\in\{1,2,\ldots\}$.

On paper, this looks straightforward: among the patients who truly experience the outcome, what fraction got scores above the operational threshold?

But in a risk-prediction setting, that quantity is already mixing over the entire subgroup risk profile. A sepsis alert model, a readmission model, and a six-month mortality model all face the same issue: the subgroup TPR is not just about who gets flagged at a given level of need, but also about where the subgroup’s patients sit on the risk spectrum to begin with.

Take a sepsis example. Imagine one group includes many patients transferred from skilled nursing facilities with hypotension, delirium, and abnormal labs already present in the first hour. Another group includes more ambulatory patients who arrive earlier, with subtler signs that only evolve later. If the hospital audits TPR at the initial alert threshold, the first group may look better served even if the alert is equally well calibrated in both groups, because the initial risk distribution is shifted upward before the model does anything.

For the purpose of this discussion, rewrite it in terms of calibrated risk $r_s = r_s(g(X)) = \Pr(Y=1\mid g(X), S=s)$

$$ \text{TPR}_s(\tau)= \frac{\int_{\tau}^{1} r\, f_s(r) \,dr}{\int_{0}^{1} r\, f_s(r) \,dr} $$

where $f_s(r)$ is the density of calibrated risk in subgroup $s$. This decomposition makes the source of the problem explicit:

The model-level rule.
The subgroup risk distribution $f_s(r)$

If the model behaves the same for equal risk in each subgroup but the $f_s$ differ, the ratio above changes anyway.

That is the whole point in one line: raw TPR is not a pure readout of model behavior.

For a clinical picture, suppose the decision threshold is $\tau = 0.30$ for an inpatient deterioration model that triggers rapid-response review. Group A includes many patients with advanced malignancy, severe frailty, and frequent prior admissions, so their calibrated risks often sit around 0.35 to 0.50. Group B includes more patients admitted earlier in the course of chronic disease, with many risks around 0.05 to 0.20 and only a small tail above 0.30. Even if the score-to-risk relationship is equally valid in both groups, Group A can easily end up with a higher raw TPR because more of its positive cases live in the part of the risk distribution that lies above the threshold.

same model behavior + different $f_s$ can still give different TPRs;
raw TPR does not isolate model fairness.

That is the confounding pattern we want to remove.

Can we do better?

Think of a counterfactual evaluation:

If subgroup s had the same risk-score distribution as a reference population, what would its TPR be?

Choose one reference subgroup (often the group with the largest sample or policy relevance, call it $S=1$). Let $f_{\text{ref}}(r)$ be that reference density.

This is the same basic logic as standardization in epidemiology. We are not pretending the subgroup actually has a different patient mix. We are asking what the subgroup’s TPR would look like if we evaluated it on the same risk profile as the reference population.

Imagine a review committee comparing readmission performance across two insurance groups. One group includes many medically complex patients discharged to unstable housing or fragmented outpatient follow-up, so the risk distribution is shifted upward. The other includes more patients with reliable follow-up and lower near-term risk. aTPR asks the committee to pause and compare the model under a common risk distribution before concluding that the higher-TPR group received a fairer model.

Define the subgroup-specific importance weight:

$$ w_s(r)=\frac{f_{\text{ref}}(r)}{f_s(r)} $$

Applying this weight to TPR, we get “adjusted” TPR aka. aTPR.

$$ \mathrm{aTPR}_s(\tau)= \frac{E\left[\mathbb{1}(r>\tau)\,r\,w_s(r)\right]} {E\left[r\,w_s(r)\right]}. $$

This expression is mathematically almost the same TPR formula, but evaluated under a reweighted distribution where subgroup $s$ is transported onto the reference risk profile.

Intuitively this means that the weight $w_s(r)$ upweights risk values that are underrepresented in subgroup $s$ relative to the reference group, and downweights those that are overrepresented. After that reweighting, the TPR is being computed in a shared risk landscape rather than two different ones.

That is why I believe aTPR is useful in practice. It answers a more targeted question:

If two subgroups were compared under the same distribution of underlying risk, would their TPRs still look different?

A gap in this scale is therefore

$$ \Delta^{\text{(adj)}}=\mathrm{aTPR}_2(\tau)-\mathrm{aTPR}_1(\tau). $$

This means

If this gap is near zero, observed raw inequality was likely driven by population risk composition.
If this gap remains material, remaining inequality is more plausibly model-behavioral.

In that sense, aTPR is not anti-fairness or pro-fairness by itself. It is a deconfounding step.

For example, imagine a 30-day readmission model used by discharge planning to trigger transitional-care calls. One demographic group has more patients with dialysis dependence, prior admissions, and medication complexity, while the other has more patients discharged after isolated, lower-acuity medical events. Suppose the raw TPR gap is 7 percentage points across those groups, but after risk standardization the gap falls to 1 percentage point. That does not prove the system is fair in every sense, but it does tell you something important: most of what looked like unequal opportunity in the raw metric was actually tied to different underlying risk composition.

aTPR estimation steps

The derivation above may also imply a simple implementation recipe. In practice, I think it helps to frame it as a four-step workflow:

Get risk on a trustworthy scale.
Learn how subgroup risk distributions differ.
Reweight one subgroup onto the reference risk profile.
Recompute the TPR under that shared profile.

Does this sounds too abstract? Let’s keep one concrete picture in mind: a mortality model with a threshold for triggering goals-of-care review. A hospital quality team notices that Spanish-speaking patients have a lower raw TPR than English-speaking patients and wants to know whether the model is actually handling equal-risk patients differently, or whether the two groups simply enter the hospital with different concentrations of severity, cancer burden, and prior utilization. We are not changing the model but changing the evaluation population, statistically, so the subgroup comparison is made on equal footing.

Step 1: Calibrate subgroup-specific risk

For each subgroup and score value, estimate calibrated risk:

$$ \hat r_{s}(g)=\Pr(Y=1\mid g(X)=g, S=s). $$

Without this step, importance weighting can amplify any existing miscalibration.

This matters clinically. If a score of 0.40 means a true 40% event probability in one subgroup but only 25% in another, then any fairness analysis built on that score scale is already standing on unstable ground. In a triage workflow, that is the difference between “this patient truly belongs near the top of the callback list” and “this patient is being treated as much sicker than they really are.”

Step 2: Estimate density ratio

Estimate $w_s(r)$ from the risk score distributions. A stable route is to model

$$ \log\frac{\hat f_{\text{ref}}(r)}{\hat f_s(r)} $$

directly (for example using density ratio methods or logistic odds style estimation) rather than fitting both marginals separately.

Conceptually, this is the part that asks: where is subgroup $s$ underrepresented or overrepresented relative to the reference population along the risk axis?

In a real dashboard, you would often see this visually as one group piling up in the low-risk bins while another has many more patients in the moderate-risk range where the intervention threshold sits. The density ratio is just the formal way of correcting for that mismatch.

Step 3: Compute weighted TPR at chosen threshold

Using sample index $i$ in subgroup $s$,

$$ \widehat{\mathrm{aTPR}}_s(\tau)= \frac{\sum_i \mathbf{1}(\hat r_i>\tau)\,\hat r_i\,\hat w_i} {\sum_i \hat r_i\,\hat w_i}. $$

This is an importance-weighted average of true-positive risk mass above threshold, normalized by weighted overall positive mass.

If you like sanity checks, this step should reduce to the ordinary subgroup TPR when the subgroup already has the same risk distribution as the reference group, because then $w_s(r)\approx 1$ across the support.

Step 4: Compare and interpret

Report both:

Raw TPR gap
Adjusted gap $\Delta^{\text{adj}}$

and discuss the direction of movement:

gap shrinks after adjustment -> composition effect dominates;
gap persists or widens -> model-induced disparity is more likely.

That side-by-side reporting is important. The adjusted metric should usually complement the raw metric, not replace it. Decision-makers need to see both the operational disparity and the risk-standardized disparity.

For instance, an ICU transfer model may still create a real operational burden if one subgroup has a lower raw TPR, even when aTPR suggests most of that gap comes from risk composition. The oversight team still needs both views: what patients and clinicians experienced, and what the model appears to be doing after risk is equalized.

How to interpret “what remains” after adjustment

This part is critical in governance discussions, because teams often over-interpret a single number.

Suppose a mortality model under the same clinical threshold shows:

raw TPR(A) > raw TPR(B) by 6 pp,
aTPR(A) $\approx$ aTPR(B).

The evidence suggests that the raw gap is mostly risk-composition driven. In plain language: the model may not be treating equal-risk patients differently; it may be seeing two populations with different concentrations of risk above and below the operational threshold.

Picture a cancer center that uses the model to prompt early supportive-care outreach. Group A includes more patients referred late, after multiple lines of therapy and worsening functional status. Group B includes more patients connected earlier through subspecialty clinics, with more low-to-moderate predicted risks at admission. A higher raw TPR in Group A would be unsurprising even if equal-risk patients are handled similarly.

In that setting, intervening with different subgroup thresholds to equalize raw TPR may risk over-correcting for a disparity that was not primarily caused by conditional model behavior.

Now consider a different example. Suppose an ED triage model for short-term deterioration shows a raw TPR gap of 5 percentage points, and after adjustment the gap is still 4.5 points. That is a very different signal. Once the groups are compared under the same risk profile, the disparity is still there. At that point, it becomes much harder to explain the gap as “just population composition.”

One realistic scenario would be a model that relies heavily on documentation patterns, prior utilization, or lab availability. If patients in one subgroup tend to have delayed labs entered, fewer prior encounters in the system, or different charting patterns at arrival, the model may continue to miss them even after risk standardization. That is closer to the kind of fairness concern people usually mean when they suspect model-driven disparity.

Conversely, if raw differences move only slightly and aTPR differences remain large, there is a cleaner fairness signal in conditional model behavior, at least with respect to risk-adjusted TPR.

So this framework does not say “no one should care about fairness” or “everyone is equally treated.” It says

Separate what is structural in the population from what is structural in the model.

That separation usually changes what intervention is appropriate.

The broader lesson is that aTPR helps you decide what kind of problem you are looking at:

a population-composition problem,
a model-behavior problem, or
some mixture of both.

Generalization

The same structure generalizes to many metrics that are expectation-like functions of calibrated risk:

$$ M_s = E_{F_s}[\phi_s(r_s)] $$

To compare subgroups under standardized risk, compute

$$ M_s^{\text{adj}} = E_{F_{\text{ref}}}[\phi_s(r_s)] =E_{F_s}[\phi_s(r_s) w_s(r_s)]. $$

That means you can construct adjusted versions of related metrics where analogous decomposition issues arise:

false-positive behavior analogs,
negative predictive metrics,
precision-oriented comparisons under distribution shift,
and potentially metrics beyond classification settings.

aTPR is therefore best viewed as one concrete instance of a more general distribution-aware fairness normalization framework.

Healthcare makes the motivation easy to see because baseline risk heterogeneity is everywhere: age structure differs, comorbidity burden differs, referral patterns differ, access to care differs, and disease prevalence differs. A safety-net hospital and a well-resourced academic referral center can feed very different patient mixes into the same model. So can a cardiology clinic and a general medicine ward. The same logic applies anywhere a fairness metric is aggregating over populations with different underlying score or risk distributions.

Assumptions, and where things can go wrong

Any adjusted estimate depends on assumptions. In practice, the failure modes are usually not philosophical. They are technical and operational.

Calibration quality

If calibrated risk is poor, both raw and adjusted metrics are noisy. aTPR is not immune to this.

This is especially relevant for clinical models transported across hospitals or service lines, where calibration drift is common. A readmission model calibrated in a tertiary center may look very different when moved to a community hospital with shorter stays, different discharge support, and different coding patterns.

Support overlap

When $f_s(r)$ and $f_{\text{ref}}(r)$ barely overlap, density ratios explode and effective sample size collapses. Trimming, smoothing, and overlap diagnostics should be part of any report.

If one subgroup has almost no patients in the high-risk range that is common in the reference group, then the reweighted estimate can become unstable very quickly. For example, if you compare a general obstetric population against an oncology inpatient population using the same reference distribution, the tails may barely overlap and the weights will behave badly.

Reference choice

aTPR is relative by design. You must justify the choice of reference group or reference risk distribution:

policy convenience,
sample size and stability,
representation goals.

There is no universally correct reference. The right choice depends on the decision context and should be stated explicitly rather than treated as a hidden default.

Threshold dependence

aTPR is still threshold-specific. Fairness signals can move with $\tau$. A full governance view usually inspects a threshold band or operating curve rather than a single number.

Trade-off awareness

Equalizing adjusted TPR is one objective among many. A model may improve aTPR parity while increasing false-positive burden in one subgroup. That is not a contradiction; it is a reminder that fairness is multidimensional.

This matters clinically because interventions are rarely free. A lower threshold may improve capture of true cases while also increasing alert fatigue, unnecessary workups, or burdensome outreach in one population. In practice that could mean more unnecessary rapid-response evaluations, more low-yield social work calls, or more palliative care consults triggered for patients who were never close to the intended risk zone.

A practical interpretation pattern for review meetings

In real review workflows, this framework may be the most useful when communicated in a short sequence:

Compute raw TPR gap.
Compute aTPR gap.
Read the decomposition:
- Raw gap ≈ 0, aTPR gap large -> likely model behavior issue.
- Raw gap large, aTPR gap small -> likely composition issue.
- Both large -> mixed problem, likely deserving deeper model diagnostics.

This gives fairness teams a clean triage structure before committing to retraining, threshold redesign, or subgroup-specific policy.

If I were presenting this in a hospital algorithm oversight meeting, I would make the message even more explicit: raw disparity tells you what operations are experiencing; adjusted disparity tells you how much of that disparity still looks like model behavior after equalizing risk composition.

One key takeaway

Raw metrics are easy to compute. Fairness-aware metrics are easy to misinterpret.

aTPR is a practical correction for one specific but common failure mode: asking a conditional question (are equal-risk people treated equally?) with an unconditional statistic (raw TPR). Once risk distributions are standardized, you can better identify where the model itself is at fault, and where group-level risk context is the explanatory factor.

That shift is the core message of this part: make fairness evaluation ask the right counterfactual.

Acknowledgment & Disclosure

This article builds on the mechanism described in:

Hegarty, S. E., Linn, K. A., Zhang, H., Teeple, S., Albert, P. S., Parikh, R. B., … & Chen, J. (2025). Assessing Algorithm Fairness Requires Adjustment for Risk Distribution Differences: Re-considering the Equal Opportunity Criterion. medRxiv.

I am collaborating with some of the authors on related research projects. The interpretations and broader perspectives presented here are my own.

Why the raw TPR is an entangled quantity#

Can we do better?#

aTPR estimation steps#

Step 1: Calibrate subgroup-specific risk#

Step 2: Estimate density ratio#

Step 3: Compute weighted TPR at chosen threshold#

Step 4: Compare and interpret#

How to interpret “what remains” after adjustment#

Generalization#

Assumptions, and where things can go wrong#

Calibration quality#

Support overlap#

Reference choice#

Threshold dependence#

Trade-off awareness#

A practical interpretation pattern for review meetings#

One key takeaway#

Acknowledgment & Disclosure#