The Score Is Not Enough: Building Reviewable Risk-Adjustment Software

Risk-adjustment software earns trust when it preserves enough of the scoring path for another person, or another tool, to inspect it.

clinical risk adjustment process steps

A Risk Score Is The End Of A Longer Path

A risk-adjustment run rarely ends with the person who wrote the scoring command.

An analytics team receives a subject file and a diagnosis file for a new reporting cycle. The data may come from claims, eligibility records, internal extracts, or a prepared research dataset. Someone runs the scoring workflow and produces a table of subject-level risk scores. The file is useful. It may feed a quality model, a review queue, a payment analysis, a monitoring dashboard, or an internal validation package.

In this context, risk adjustment means using clinical, demographic, and eligibility information to account for differences in expected cost, utilization, outcomes, or reporting risk across patients or populations.

The workflow sounds simple when it is described by its endpoint. In practice, clinical risk scoring usually has a repeated shape. Subject records carry demographics and eligibility-like fields. Diagnosis records carry ICD codes and enough identifiers to attach those codes to people, encounters, claims, or sources. The scoring logic maps diagnosis codes into binary indicators such as HCCs, RxHCCs, ESRD categories, or comorbidity flags. Model-specific rules may apply hierarchies, demographic predictors, disease interactions, payment-status terms, or edits. A coefficient table then turns those predictors into risk-adjustment factors, comorbidity measures, or subject-level scores.

That pattern recurs across model families even when the details differ. The field names change, the mappings change, the coefficients change, and the policy context changes, but the computational path often has the same outline: normalize inputs, validate records, map diagnoses, generate predictors, apply rules, multiply by weights, produce scores, and preserve enough context to explain what happened.

Then the first review question arrives:

Which model version produced this score? Which diagnoses mapped to categories? Which codes were rejected, ignored, or left unmapped? Which hierarchy rules removed predictors that appeared earlier in the workflow? Which interaction terms were added? Were there validation warnings about missing demographic fields, invalid flags, duplicate subject identifiers, or diagnosis rows attached to unknown subjects? If a stakeholder asks about one subject next month, can the team reconstruct the path from input records to final score?

Those questions are ordinary. They arise long before a scandal, a failed audit, or a dramatic production incident. They appear whenever risk-adjustment output enters a setting where another person has to review, explain, reproduce, or defend the result.

That is why a score table is only one artifact in a larger workflow. The score is the visible endpoint. The work before it includes input normalization, model selection, diagnosis mapping, validation, hierarchy logic, interaction logic, coefficient application, rounding, output formatting, and provenance tracking.

Fragmented Tools Make Similar Work Feel Different

The frustrating part is that similar operations often live behind very different software surfaces.

An official tool may be authoritative for a model release while still being awkward to call from a typed Python pipeline. An institutional workflow can be reliable inside one analytics shop and hard to reuse outside its local database, SAS environment, naming conventions, and review process. A notebook can be excellent for exploration and weak as a repeatable scoring surface. An open-source package may cover one model family, one language, or one output shape without also carrying batch scoring, review artifacts, dataframe adapters, and human-facing inspection tools.

The result is a lot of reintegration work around a familiar core. Teams glue together mapping files, scoring scripts, validation checks, review tables, command-line jobs, notebooks, and dashboard views. Each piece may be reasonable. The friction appears when the pieces have to agree about model version, input schema, validation behavior, artifact names, and what evidence remains after the run.

The gap is a reusable scoring surface that keeps model selection, validation behavior, intermediate artifacts, and review evidence in the same path.

That gap motivated risk-compose, a typed Python package for deterministic risk-adjustment scoring from subject and diagnosis data. Its role is independent of official CMS or AHRQ guidance, and it still requires independent validation.

The package is a useful case study because it makes one design pressure explicit: in healthcare analytics, usefulness and reviewability are tightly coupled. A tool has limited value in a review-heavy environment when it produces the correct-looking final table while hiding the path to that table. Exposing contracts, intermediate artifacts, validation behavior, and data provenance gives the team something more durable than a number.

The Engineering Discipline Is Part Of The Package

I wanted risk-compose to be expandable across clinical risk-adjustment metrics without turning every new model into another one-off script. That goal pushed the implementation toward ordinary software engineering practices that matter more when the output is reviewed by other people.

The domain contract has to be explicit. Subjects, diagnoses, scoring options, validation issues, table artifacts, predictor artifacts, score artifacts, and explanation bundles are named concepts in the package rather than incidental dictionaries moving through a script. Once those records exist, the rest of the design has a clearer place to attach behavior.

The deterministic scoring path also has to stay separate from its surfaces. Diagnosis mapping, validation, predictor generation, coefficient application, artifact export, and interface code belong in separate parts of the package rather than one large procedure. The scoring core should be usable without a browser UI. A CLI should call the same scoring path as the Python API. A terminal review interface and a Streamlit interface should review exported scoring outputs rather than inventing their own scoring rules.

Operational differences need the same treatment. Validation needs onboarding and production modes. Model data needs versioned provenance. Dataframe adapters should be optional because pandas, polars, and PySpark reflect different institutional constraints. Tests should verify behavior through public records and artifact tables, rather than only through private helper functions.

These choices decide whether the package can grow from CMS-HCC to ESRD, RxHCC, Elixhauser, and future risk-adjustment metrics without making each addition feel like a fresh application. A broader metric set still requires model-specific mappings, coefficients, rules, documentation, and validation. The shared package shape should absorb the repeated workflow: records in, model artifacts selected, predictors generated, scores computed, validation preserved, review artifacts exported.

Reviewability Is A Software Requirement

Risk-adjustment workflows sit in an uncomfortable part of the software world. They are often deterministic rather than statistical in the modeling sense. Given the same inputs, model version, and options, the output should be reproducible. At the same time, they are rarely trivial scripts. The domain rules are specific. The input data can be messy. The outputs can influence reporting, payment-adjacent analysis, model development, quality measurement, or retrospective review.

That combination changes the standard for a useful package. A convenient function that returns a numeric score is only the starting point. The function boundary needs to state what the software expects, which options control the run, what artifacts are returned, and how validation issues are represented.

In risk-compose, the central public API is organized around typed records and a typed scoring request:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


from datetime import date

from risk_compose import DiagnosisRecord, ScoringRequest, SubjectRecord, score_subjects

request = ScoringRequest(
  subjects=(
    SubjectRecord(
      subject_id="B1",
      date_of_birth=date(1953, 5, 1),
      sex=2,
      original_reason_entitlement_code=0,
    ),
  ),
  diagnoses=(DiagnosisRecord(subject_id="B1", icd10_code="E119"),),
)

result = score_subjects(request)

This example is included for the shape of the boundary rather than as a full tutorial. Subjects and diagnoses are named domain records rather than anonymous dictionaries passed into an opaque scoring routine. Run-level behavior is carried by scoring options. The result is a bundle of predictor artifacts, score artifacts, validation issues, and model metadata rather than a lone floating-point value.

That shape matters before production data arrives. A team can review the required subject fields. It can decide how diagnosis rows should preserve claim identifiers, service dates, diagnosis sequence, source labels, or present-on-admission indicators. It can see which model version will be used by default and which options change the run. It can write tests against stable artifact tables instead of scraping incidental output.

Reviewability starts at the interface. If the interface is a loose collection of files, flags, and return values, then the review burden shifts into convention and memory. If the interface names the domain objects and artifacts directly, the software gives both humans and tests something firmer to inspect.

Types Make Domain Assumptions Harder To Hide

Python types, by themselves, make only a modest safety claim for healthcare software. A type annotation cannot decide whether a model is appropriate for a contract, whether upstream data was extracted correctly, or whether a score should be used for a particular regulatory purpose. Treating types as a safety guarantee would be another form of over-claiming.

The more modest claim is still valuable: explicit types make assumptions harder to smuggle past the reader.

Consider the difference between a row-shaped dictionary and a SubjectRecord. A dictionary can contain dob, birth_date, DOB, or no date of birth at all. It can contain a sex code whose meaning is known only to the upstream extract. It can include model-specific fields under project-specific names. Some of that flexibility is convenient during early exploration and expensive during review because the domain contract is implicit.

A SubjectRecord makes the package say what it expects:

a stable subject_id
a date_of_birth
a sex value
an original reason entitlement code where relevant
model-specific eligibility, Medicaid, ESRD, dual status, or institutional flags where the model requires them

A DiagnosisRecord does the same for diagnosis observations. At minimum it ties an ICD-10 code to a subject. Optional fields such as service date, claim ID, source, diagnosis sequence, and present-on-admission indicator remain visible rather than disappearing into an untyped row.

The same pattern applies to ScoringOptions, ScoringRequest, ValidationIssue, and TableArtifact. Options such as model version, MCE edits, strict validation, score contribution output, and rounding are run-level choices rather than ad hoc flags scattered through code. Validation issues have severity, code, message, subject, and field context. Table artifacts have names, columns, and rows.

This is ordinary software design, which is exactly what high-stakes analytical code often needs. When the domain objects are explicit, a team can distinguish several kinds of problems that otherwise blur together:

missing data in the input feed
invalid data according to the package’s validation rules
model-specific required fields that were not supplied
developer mistakes in how the API was called
disagreements about whether the upstream data contract is adequate

Those are different problems. They require different responses. A missing date of birth in an onboarding feed calls for a different response than a code defect in the scoring core. An unknown diagnosis subject ID should surface as a validation issue rather than disappearing into a lower score. A model-specific field requirement should be visible at the run boundary before it appears as an unexplained downstream artifact.

Typed records leave those decisions with the team while making the software state where they enter.

Deterministic Artifacts Support Review

The most important review move in a scoring workflow is the ability to walk backward.

Start with a subject score. The reviewer wants to know which predictors contributed to it. For a diagnosis-derived predictor, the next question is which diagnosis codes mapped to the relevant category. If a category is absent, the reviewer may need to know whether the code was invalid, unmapped, rejected by an edit, suppressed by hierarchy logic, or never present in the input. If an interaction term appears, the reviewer needs to know which lower-level predictors made it eligible. If a score changed across runs, the team needs to know whether the inputs changed, the model version changed, or validation behavior changed.

That backward path is hard to reconstruct if the software only emits a final subject-level table. It is easier when the workflow emits stable intermediate artifacts:

subject_predictors.csv
subject_scores.csv
diagnosis_mappings.csv
score_contributions.csv
validation_issues.csv

The names are plain because the artifacts should be boring. A reviewer should be able to ask basic questions about a score before reverse engineering a clever internal object model. The predictor table shows generated demographic, diagnosis, hierarchy, interaction, and model-specific predictors. The score table shows totals by subject and score family. Diagnosis mappings preserve how diagnosis records relate to categories. Score contributions expose factor-level contributions to final scores. Validation issues keep non-strict data problems in a shape that can be inspected or exported.

Subject-level explanation bundles extend the same idea. A one-subject review can include a subject summary, predictors, diagnosis mappings, hierarchy effects, interaction details, score contributions, subject scores, and RAF totals. The bundle gives the reviewer lineage through the workflow while leaving broader interpretation questions where they belong.

That distinction matters. In machine learning, “explainability” often carries a broad set of claims about model interpretation. Risk-adjustment scoring has a different review need. Much of the core logic is deterministic and rule-governed. The practical question is whether the software preserves the sequence of artifacts that produced the output. If the hierarchy removed a predictor, show it. If an interaction added a contribution, show it. If a diagnosis mapped to a category, show it. If a validation issue was found but non-strict mode allowed the run to continue, show it.

Stable artifacts also change how teams test software. Instead of testing only that a final score equals an expected number, tests can assert that the predictor schema is stable, diagnosis mappings are emitted, score contributions are present when requested, and validation issues use structured codes. That is useful for package development and for downstream institutions that need to build their own acceptance checks around a scoring workflow.

When a stakeholder questions a subject’s score, the best time to generate the review materials was during the original run. Reconstructing them later from partial logs, remembered flags, and stale input files is much weaker.

Validation Needs Different Behavior Across Phases

Data work has at least two phases with different failure behavior.

During onboarding, a team often wants to learn about the shape of a new feed. It may expect missing fields, invalid flags, duplicate identifiers, malformed dates, unknown diagnosis subject IDs, or diagnosis rows that need upstream cleanup. Stopping at the first blocking issue can slow that work down because the team needs a fuller inventory of problems.

In production, the priority changes. The workflow should fail when blocking issues make the output unsafe to use. A score file that proceeds through serious data defects can create false confidence. The team may prefer no output to an output table that looks complete but rests on invalid assumptions.

That is why risk-compose supports non-strict and strict validation modes. Non-strict workflows emit validation artifacts so the team can inspect multiple issues in one run. Strict workflows fail on blocking validation issues, making them better suited for production scoring jobs.

The distinction is easier to see with concrete examples. During feed onboarding, a team may want to know that several subjects are missing date of birth, some sex values are invalid, one original reason entitlement code is outside the expected range, a diagnosis row references an unknown subject ID, some diagnosis rows have missing ICD-10 codes, and one subject ID appears twice. Seeing all of those issues together helps the team fix the upstream extract.

In a production scoring job, those same defects may be unacceptable. If the model requires date of birth, sex, entitlement status, or a model-specific flag, then continuing with a polished score table can be misleading. The validation mode should match the operating phase.

This design also helps teams avoid a common trap in analytical software: treating validation as either a hard gate everywhere or a warning system everywhere. Both extremes are clumsy. Hard failure everywhere makes early data integration frustrating. Warnings everywhere make production enforcement too easy to ignore. A reviewable package should support both behaviors and make the choice explicit at the run boundary.

The output artifact matters here as much as the failure behavior. In non-strict mode, validation issues should be exported with enough structure to support triage: severity, code, message, subject ID, and field name where applicable. A team that needs to assign cleanup work, compare feed quality across runs, or document why a file was rejected needs more than a prose warning printed to a terminal.

Every Surface, One Scoring Path

Risk-adjustment scoring happens across several environments.

Some users want a Python API because scoring is part of a larger analytical pipeline. Some want a command-line interface because scoring runs inside batch jobs, shell scripts, CI checks, or scheduled workflows. Some want a terminal review interface because they live close to the data and want fast inspection without a browser. Some want a browser-based interface because the review audience is broader than the engineering team.

risk-compose exposes those surfaces through the Python package, command-line workflows, a terminal interface, and a Streamlit/browser interface. Optional dataframe adapters keep pandas, polars, and PySpark support out of the default install path. That matters because dataframe engines carry real dependency weight and often reflect institutional constraints. A team using PySpark in a distributed environment has different needs from a team scoring a small CSV extract with pandas.

The command-line workflow keeps the batch case simple:

1
2
3
4
5
6


risk-compose score \
  --subjects subjects.csv \
  --diagnoses diagnoses.csv \
  --output-dir out/score \
  --model-version cms_hcc_v28_2026 \
  --strict

The same design principle applies here as in the Python API. The command should write the review artifacts that make the run inspectable along with the final score table: predictors, scores, diagnosis mappings, score contributions, and validation issues where relevant.

The TUI and GUI are review surfaces over exported artifacts and scoring outputs rather than separate scoring concepts. That distinction is important. If every interface reimplemented its own scoring logic, the package would become harder to validate. A better design keeps the scoring core deterministic and lets different interfaces serve different human workflows.

The same artifact discipline also helps when a coding agent participates in the workflow. An agent can call the typed API, run the CLI, inspect CSV artifacts, summarize validation issues, or draft tests against stable table names. The supervision point is that the human reviewer does not have to accept the agent’s summary as the evidence. The same exported artifacts remain available in the terminal, browser, or downstream review process.

This is also where dependency discipline enters the design. Analytical packages often grow by adding convenient surfaces until every user pays for every dependency. That creates friction in production environments, especially where deployment images, security review, or Python compatibility constraints matter. Optional dataframe adapters and separate review surfaces make the package easier to fit into different workflows without pretending that every user has the same environment.

The practical consequence is simple: install and use the surface that matches the job. A pipeline may only need the Python API and CSV artifacts. A data validation analyst may want the CLI plus exported validation tables. A reviewer may prefer the terminal or browser interface. The scoring core should remain the common source of behavior.

Provenance Helps, But Authority Still Lives Elsewhere

Packaged runtime data is convenient. It also creates a responsibility to be clear about where the package’s authority ends.

risk-compose includes curated runtime tables derived from public CMS and AHRQ materials. That improves reproducibility because a user can install a package version and run against packaged model artifacts rather than manually assembling every table for every environment. The exact supported model list belongs in the package documentation and release notes; the design point here is that tests and examples can refer to named model artifacts instead of loose local files.

Packaged artifacts leave official authority elsewhere. The PyPI project description states that risk-compose is independent and not affiliated with, endorsed by, or associated with CMS or AHRQ. Users still need to review upstream terms, official documentation, regulatory guidance, contract-specific rules, and their own validation requirements.

That boundary should appear in the article because it is part of responsible software design. Provenance is more than a decorative disclaimer. It tells the user what kind of trust the package is asking for.

The package can aim to make accidental variation less likely by preserving versioned runtime artifacts, typed inputs, deterministic outputs, validation issues, review bundles, and an inspectable scoring path. Decisions about regulated use, official model interpretation, and independent validation still live with the organization using the software.

Good tooling reduces avoidable uncertainty while making remaining uncertainty visible. That is the right posture for healthcare analytics software. The package should help users reproduce and review a run rather than invite them to outsource judgment to the package.

The Useful Package Preserves The Path

The opening scenario is ordinary because review is ordinary.

A score lands in a report. A reviewer asks why it changed. A subject-level value looks surprising. A validation analyst wants to know whether missing fields were concentrated in one source file. An engineer wants to confirm that a model version was pinned. A stakeholder asks whether the same run can be reproduced after the next package release.

The useful package is the one that expects those questions.

In risk-adjustment workflows, the engineering details are the capability. A final score is useful because it can be connected back to the inputs, mappings, rules, coefficients, options, and validation behavior that produced it. Typed records name what the software expects. Deterministic artifacts preserve the backward walk. Structured validation issues keep data problems visible rather than absorbed into a quieter output. When that path is preserved, another person can inspect the run without relying on the original developer’s memory.

That is the standard I wanted risk-compose to meet. It is one implementation of a broader engineering discipline rather than a claim to replace official software or guidance. If software is going to participate in healthcare analytics work where outputs are reviewed, questioned, reproduced, and bounded by policy, then reviewability should be part of the design from the start.

The score is one artifact. The workflow is what has to hold up.

A Risk Score Is The End Of A Longer Path#

Fragmented Tools Make Similar Work Feel Different#

The Engineering Discipline Is Part Of The Package#

Reviewability Is A Software Requirement#

Types Make Domain Assumptions Harder To Hide#

Deterministic Artifacts Support Review#

Validation Needs Different Behavior Across Phases#

Every Surface, One Scoring Path#

Provenance Helps, But Authority Still Lives Elsewhere#

The Useful Package Preserves The Path#