One week ago, I wrote about why I’m learning F# — a language I may never use professionally, but one I believe will change how I think in the languages I do use. The thesis was simple: F# might be the sweet spot for researchers who need readable, concise, and safe code without paying Rust’s memory-management tax or Haskell’s conceptual overhead.

That was nearly weakly-educated speculation. This is what happened when I tried to prove it.

F# and Statistics


Finding the Right First Project

I needed something that would test F# in my actual domain — numerical and statistical programming — without being so ambitious that the language learning would drown in project complexity.

The criteria:

  • Meaningful: Not a toy example. Something I’d genuinely want in my toolkit.
  • Scoped: Completable in a focused week, not a months-long odyssey.
  • Familiar territory: Constrained optimization, likelihood-based inference, matrix operations — problems I’ve solved before in Python and Rust.

My background includes (but not restricted within) several projects involving optimizations and penalized regressions. That pointed toward a specific class of problems: maximum likelihood estimation with side constraints.

Then I came across a last-year publication that fit perfectly.


The Paper: Constrained Maximum Likelihood Estimation

The method comes from Cao et al. (2024):

A constrained maximum likelihood approach to developing well-calibrated models for predicting binary outcomes.
Lifetime Data Analysis, 30, 624–648.

The paper addresses a problem I’ve encountered repeatedly in population health research:

You have a well-calibrated base model $\phi(X)$ for predicting some outcome. You want to improve it by adding a new predictor $Z$. But your training data comes from a different population than your target population — and if you just fit a standard logistic regression on $(X, Z)$, the resulting model will be miscalibrated where it matters most.

This is covariate shift, and it’s endemic to healthcare analytics. Your training data is whoever showed up at a particular clinic; your target population is the broader population you actually want to serve.

What Goes Wrong Without Constraints

Standard maximum likelihood assumes your training distribution matches your target distribution:

$$ P_S(X, Z, Y) \approx P(X, Z, Y) $$

When this assumption fails — different covariate distributions, different outcome prevalence, or both — the fitted model can dramatically misestimate risk, especially in the tails. A model that looks good on average can be dangerously wrong for high-risk patients.

The c-MLE Solution

The key insight is to constrain the optimization. Instead of just maximizing likelihood, you maximize likelihood subject to calibration constraints derived from a trusted base model.

Concretely, you partition the risk spectrum into intervals based on the base model $\phi(X)$, then enforce that your new model’s predictions agree with the base model’s calibration on average within each interval:

$$ (1 - \delta)\,\pi_r \le \mathbb{E}\Big[g_\theta(X, Z) \mid \phi(X) \in I_r\Big] \le (1 + \delta)\,\pi_r $$

where $\pi_r$ is the true average risk in interval $I_r$ and $\delta$ is a tolerance parameter.

This forces the new model to inherit the base model’s calibration in the target population while still benefiting from the additional predictor $Z$.


How c-MLE Works

The implementation requires three interlocking pieces.

1. Joint Likelihood

We model both the outcome $Y$ and the new predictor $Z$ given $X$:

$$ \mathcal{L}(\theta, \tau, \sigma) = \sum_i \Big[\log P_\theta(Y_i \mid X_i, Z_i) + \log f_{\tau,\sigma}(Z_i \mid X_i)\Big] $$

The first term is standard logistic regression. The second term models the conditional distribution of $Z \mid X$ — necessary because we need to compute expectations over $Z$ when evaluating the calibration constraints.

2. Conditional Model for $Z \mid X$

Following the paper, I model:

$$ \log(Z) \sim \mathcal{N}(\tau^\top [1, X], \sigma^2), \quad \text{truncated to } (-\infty, 0] $$

This places $Z \in (0, 1]$ — appropriate for the paper’s application (breast density) and many other bounded predictors.

3. Penalty-Based Constraint Enforcement

Rather than solving a constrained optimization problem directly, we convert it to an unconstrained problem via penalty:

$$ J(\theta) = -\mathcal{L}(\theta) + \lambda \sum_r \max(0, \text{violation}_r)^2 $$

This is less elegant than Lagrangian methods but far simpler to implement — and simplicity matters when the goal is learning a new language.


Implementation Architecture

The F# implementation follows a “functional core, imperative shell” pattern:

Module Responsibility Purity
MathHelpers Sigmoid, truncated log-normal density, Monte Carlo sampling Pure
Fitting Joint log-likelihood, risk prediction Pure
Constraints Calibration constraint evaluation Pure
ParameterOps Vector ↔ record conversion for optimization Pure
Optimization BFGS wrapper, numerical gradients Impure

The pure core contains all the statistical logic. The impure shell handles optimization iteration, logging, and I/O. This separation isn’t just aesthetic — it makes the statistical code testable and the optimizer swappable.

Domain Types as Documentation

The type definitions directly mirror the statistical structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
type Parameters = {
  beta_0: float
  beta_x: Vector<float>
  beta_z: float
  density: ConditionalDensity
}

type CalibrationTarget = {
  intervals: Interval list
  expected_risks: float list
  tolerance: float
  x_distribution: (Vector<float> * float) list
}

Anyone familiar with the paper can read these types and understand the implementation’s structure. The types aren’t incidental — they are the specification.


Key Implementation Patterns

A few excerpts that illustrate what F# brings to this problem.

Railway-Oriented Error Handling

The main API uses a result computation expression:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
let fit_constrained_model
  (data: TrainingData)
  (calib: CalibrationTarget)
  (base_model: Vector<float> -> float)
  : Result<Parameters, string> =

  result {
    let! density =
      Fitting.fit_conditional_density data.x data.z

    let initial = Optimization.initialize_parameters data density

    let! final_params =
      Optimization.optimize data calib base_model initial

    return final_params
  }

Each step can fail. The let! syntax short-circuits on error, threading failures through without explicit checking. Compare to Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def fit_constrained_model(data, calib, base_model):
  density = fit_conditional_density(data.x, data.z)
  if density is None:
    return None
  
  initial = initialize_parameters(data, density)
  
  final_params = optimize(data, calib, base_model, initial)
  if final_params is None:
    return None
  
  return final_params

The Python version is readable, but the error handling is manual and easy to forget. The F# version makes failure paths explicit in the type system — if a function returns Result<'a, 'e>, you must handle the error case.

Pipelines That Read Like Math

Constraint evaluation flows naturally:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
let compute_expected_risk_in_interval params' interval x_dist base_model rng =
  x_dist
  |> List.filter (fun (x, _) -> interval.contains (base_model x))
  |> function
     | [] -> 0.0
     | items ->
       items
       |> List.map (fun (xi, weight) ->
         let z_samples =
           MathHelpers.sample_truncated params'.density xi 100 rng
         let avg_risk =
           z_samples.ToArray()
           |> Array.averageBy (fun zi ->
             Fitting.predict_risk params' xi zi)
         avg_risk * weight, weight)
       |> List.reduce (fun (r1, w1) (r2, w2) -> (r1 + r2, w1 + w2))
       |> fun (total_risk, total_weight) ->
         if total_weight = 0.0 then 0.0
         else total_risk / total_weight

The data flows top to bottom. Each transformation is explicit. The structure mirrors the mathematical operation: filter to the relevant interval, compute weighted average risk via Monte Carlo, aggregate.

Tail Recursion for Rejection Sampling

Sampling from a truncated distribution requires rejection sampling — repeatedly draw until you get a valid sample. In Python, this would be a while loop with mutable state. In F#, tail recursion is idiomatic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
let rec sample_one attempts =
  if attempts >= 1000 then
    1e-10  // fallback in degenerate cases
  else
    let y = norm_dist.Sample()
    if y <= 0.0 then
      y
      |> Math.Exp
      |> clamp_z
    else
      sample_one (attempts + 1)

The recursive structure expresses the algorithm directly: sample, check, retry if needed. F# optimizes tail recursion to a loop, so there’s no stack overhead — you get the clarity of recursion with the performance of iteration.


Making It Actually Work: Iterative Improvements

The first version compiled and ran. It also produced nonsense.

This is the reality of numerical programming: a correct-looking implementation can fail silently when the optimizer fights against Monte Carlo noise, penalty weights are miscalibrated, or numerical edge cases create cliffs in the objective landscape.

Here’s what I fixed — and why it mattered.

Problem 1: Monte Carlo Variance Was Destabilizing the Optimizer

The original implementation sampled fresh random numbers every time the constraint function was evaluated. This meant the objective function itself was stochastic — BFGS would compute a gradient, take a step, and find the objective had changed not because of the step but because of sampling noise.

Solution: Common Random Numbers (CRN)

I pre-generate and freeze the noise vectors in an IntegrationContext:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
type IntegrationContext = {
  noise_vectors: Vector<float> list
  n_integration_samples: int
}

let create_integration_context target_x n_samples seed =
  let rng = Random(seed)
  let noise =
    target_x
    |> List.map (fun _ ->
      let samples = Array.init n_samples (fun _ -> rng.NextDouble())
      Vector.Build.DenseOfArray samples)
  { noise_vectors = noise; n_integration_samples = n_samples }

Now the same uniform noise is reused across all optimization iterations. The objective function becomes deterministic for a given parameter vector, and the optimizer can make consistent progress.

Problem 2: Rejection Sampling Was Inefficient and Non-Deterministic

The original sampler used rejection sampling with a while loop. Combined with the variance problem above, this created unpredictable behavior.

Solution: Inverse Transform Sampling

For a truncated normal, you can sample directly via the inverse CDF:

1
2
3
4
5
6
7
8
let sample_truncated_deterministic density x u_noise =
  let mu = density.tau_mean * x_aug
  let p_max = Normal.CDF(mu, sigma, 0.0)  // truncation point

  u_noise.Map (fun u ->
    let p_scaled = u * p_max
    let log_z = Normal.InvCDF(mu, sigma, max p_scaled 1e-12)
    Math.Exp(log_z) |> max 1e-12 |> min (1.0 - 1e-12))

This maps pre-generated uniform noise to the truncated distribution deterministically. No rejection, no variance, no surprises.

Problem 3: Fixed Penalty Weight Couldn’t Balance Likelihood and Constraints

This problem requires some context. True c-MLE is a constrained optimization problem:

$$ \max_\theta \mathcal{L}(\theta) \quad \text{subject to} \quad g_r(\theta) \le 0, \quad r = 1, \dots, R $$

where $g_r(\theta)$ encodes the calibration constraints (upper and lower bounds on expected risk per interval).

The Theoretically Correct Approach: KKT Conditions

The Karush-Kuhn-Tucker (KKT) conditions characterize the solution. At the optimum $\theta^*$, there exist multipliers $\lambda_r \ge 0$ such that:

$$ \nabla \mathcal{L}(\theta^*) = \sum_r \lambda_r \nabla g_r(\theta^*) $$

with complementary slackness: $\lambda_r g_r(\theta^*) = 0$ for each $r$.

Intuitively, the gradient of the objective is a linear combination of constraint gradients — you can’t improve the objective without violating some binding constraint.

Why This Requires SQP

Solving the KKT system directly is intractable for nonlinear problems. The standard approach is Sequential Quadratic Programming (SQP), which iteratively:

  1. Approximates the Lagrangian Hessian
  2. Linearizes constraints around the current point
  3. Solves a Quadratic Programming (QP) subproblem to get a search direction
  4. Updates both parameters $\theta$ and multipliers $\lambda$

SQP is the workhorse of nonlinear constrained optimization — it’s what you’d use in MATLAB’s fmincon, Python’s scipy.optimize.minimize(..., method='SLSQP'), or dedicated solvers like IPOPT and SNOPT.

MathNet Doesn’t Have SQP

MathNet.Numerics provides excellent unconstrained optimizers — BFGS, conjugate gradient, Newton-Raphson — but no constrained solvers. This is a common gap in numerical libraries outside the optimization-specialist ecosystem.

So I needed an alternative.

Penalty Methods: The Classic Workaround

The textbook solution is to convert the constrained problem to an unconstrained one:

$$ \min_\theta \Big[ -\mathcal{L}(\theta) + \rho \sum_r \max(0, g_r(\theta))^2 \Big] $$

The penalty term $\rho \sum_r \max(0, g_r)^2$ adds a cost for constraint violations. As $\rho \to \infty$, the penalized solution approaches the true constrained solution.

The catch: a fixed $\rho$ is a blunt instrument. Too small and constraints are ignored. Too large and the optimizer fixates on constraints before finding a good likelihood region — or worse, the problem becomes ill-conditioned and BFGS fails to converge.

Solution: Penalty Scheduling

Instead of picking a single $\rho$, I start small and increase geometrically:

1
2
3
4
5
6
7
8
let rec schedule_loop current_params rho iteration =
  if rho > max_penalty_weight then current_params
  else
    let next_params = run_bfgs current_params rho
    let _, max_viol = Constraints.calculate_violations next_params ...
    
    if max_viol < violation_tolerance then next_params
    else schedule_loop next_params (rho * penalty_multiplier) (iteration + 1)

The optimizer starts with $\rho = 0.1$, finds a good likelihood region, then progressively tightens constraints by multiplying $\rho$ by 5 each iteration until violations fall below tolerance.

This is a simplified form of the augmented Lagrangian method (also called the method of multipliers). A full implementation would also update Lagrange multiplier estimates between iterations, but even this basic penalty schedule works far better than a fixed weight — and it’s implementable with any unconstrained optimizer.

Problem 4: Numerical Cliffs in the Objective

Returning -1e10 for bad density values creates a cliff in the objective landscape. The optimizer can’t compute meaningful gradients near these cliffs, so it gets stuck or diverges.

Solution: Soft Floors

Instead of hard cutoffs, I use soft floors that create steep slopes rather than cliffs:

1
2
3
let safe_pdf = max 1e-100 pdf_val
let safe_cdf = max 1e-100 cdf_upper
Math.Log safe_pdf - Math.Log safe_cdf - Math.Log z_clamped

The gradient is enormous in bad regions, but it exists — the optimizer can escape.


Watching It Converge

Running the example produces this output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
[INFO] Fitted Z|X density: sigma = 0.1718
[INFO] Schedule Iteration 1: rho = 0.1
[INFO]   -> Max Constraint Violation: 0.162976
[INFO] Schedule Iteration 2: rho = 0.5
[INFO]   -> Max Constraint Violation: 0.105362
[INFO] Schedule Iteration 3: rho = 2.5
[INFO]   -> Max Constraint Violation: 0.042357
[INFO] Schedule Iteration 4: rho = 12.5
[INFO]   -> Max Constraint Violation: 0.013667
[INFO] Schedule Iteration 5: rho = 62.5
[INFO]   -> Max Constraint Violation: 0.004026
[INFO] Schedule Iteration 6: rho = 312.5
[INFO]   -> Max Constraint Violation: 0.000806
[INFO] Schedule Iteration 7: rho = 1562.5
[INFO]   -> Max Constraint Violation: 0.000164
[INFO] Schedule Iteration 8: rho = 7812.5
[INFO]   -> Max Constraint Violation: 0.000000
[INFO] Constraints satisfied.
Success! Beta_Z: 0.9778 (True ~1.0)
Density Sigma: 0.3661

The story is visible in the numbers:

  1. Initial fit: The density model estimates $\sigma \approx 0.17$ from the data via OLS on $\log Z$.

  2. Early iterations ($\rho = 0.1$ to $2.5$): The optimizer focuses primarily on likelihood, with constraints as a gentle nudge. Violations drop from 16% to 4%.

  3. Middle iterations ($\rho = 12.5$ to $312.5$): Constraints become dominant. The optimizer sacrifices some likelihood fit to satisfy calibration bounds. Violations drop to 0.08%.

  4. Final iterations ($\rho = 1562.5$ to $7812.5$): Fine-tuning. Violations drop below the $10^{-4}$ tolerance and the loop terminates.

  5. Result: $\hat{\beta}_z = 0.9778$ against a true value of $1.0$ — a 2.2% relative error. The model is both well-fitted and calibrated.

This is exactly what the penalty scheduling approach is designed to produce: find a good solution first, then enforce constraints progressively rather than all at once.


Lessons Learned

F# Let Me Focus on the Logic

This was the main thesis, and it held up. Implementing c-MLE, I spent my cognitive budget on:

  • Likelihood functions and their gradients
  • Constraint geometry and penalty formulation
  • Domain modeling — how parameters, data, and targets relate

I did not spend time on:

  • Defensive None checks scattered through the codebase
  • Debugging silent NaN propagation from type mismatches
  • Tracing runtime errors back to incorrect assumptions

In Python, I think about logic and defensive coding simultaneously. In Rust, I think about logic and ownership simultaneously. In F# — for this class of problem — I just thought about the logic.

The Ecosystem Gap Was Smaller Than I Feared

I expected to miss NumPy and SciPy constantly. I didn’t.

MathNet.Numerics covered everything c-MLE needed: dense matrices, linear solvers, probability distributions, and BFGS optimization. The API is clean, the documentation is adequate, and performance was never a bottleneck for this problem.

This won’t hold universally. If I needed GPU acceleration, deep learning primitives, or the full scipy.stats menagerie, I’d be reaching for Python. But for algorithm-focused work at moderate scale — implementing a paper, prototyping a method, building a simulation — the gap is narrower than I expected.

The Right Tool Depends on What You’re Building

Here’s the distinction that crystallized for me:

Python’s ecosystem is unbeatable when you’re calling established methods. Need logistic regression with L2 regularization? LogisticRegression(penalty='l2').fit(X, y) and you’re done. The ecosystem does the work.

But when you’re implementing novel methods — translating a recent paper into working code, extending an algorithm, building something that doesn’t exist in a library yet — you’re not composing APIs. You’re writing the logic those APIs would eventually wrap.

That’s where F# earns its keep.

Python’s ecosystem is the paved road. When you’re going where the roads don’t reach yet, you need different equipment.

Types Caught Real Bugs

Two specific examples:

  1. I initially had CalibrationTarget.expected_risks as float[] instead of float list. The type mismatch when I tried to use List.indexed forced me to think about whether I wanted array semantics or list semantics — and I realized list was correct for the access patterns I needed.

  2. The Parameters record made it impossible to accidentally pass density parameters where logistic parameters were expected. In Python, both would be np.ndarray, and the bug would surface as wrong results, not a type error.

These aren’t dramatic saves. They’re small friction points that, in Python, would have become debugging sessions later.

Ionide Is Genuinely Excellent

The tooling exceeded expectations. Type inference feedback is near-instant. Errors are precise. Refactoring works reliably. The REPL integration lets you test functions interactively without the ceremony of a full test harness.

Coming from rust-analyzer (which is excellent but must reason about lifetimes and borrow checking), Ionide feels faster — not because it’s better engineered, but because F#’s type system is simpler. The tooling benefits from the language’s design.

Imperative First, Functional Later

My actual workflow wasn’t “think functionally from the start.” It was:

  1. Write something that works, often imperatively
  2. Notice the shape of the computation
  3. Refactor toward pipelines or recursion when the pattern clarifies

For example, my first version of computing expected risks per calibration interval looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
let expected_risks =
  intervals
  |> List.map (fun interval ->
    let mutable s = 0.0
    let mutable c = 0.0
    for i in 0 .. n_target - 1 do
      let r = base_risks.[i]
      if interval.contains r then
        s <- s + r
        c <- c + 1.0
    if c = 0.0 then 0.0 else s / c
  )

It worked. Once I saw the pattern — filter, then average — the functional version was obvious:

1
2
3
4
5
6
7
8
9
let expected_risks =
  intervals
  |> List.map (fun interval ->
    base_risks
    |> Array.filter interval.contains
    |> function
       | [||] -> 0.0
       | arr  -> Array.average arr
  )

The same progression happened with rejection sampling: while not done_ became tail recursion once I understood the algorithm well enough to express it declaratively.

F# accommodates this gracefully. mutable exists. for loops exist. You can write Python-shaped code, get it working, then incrementally tighten it. The language doesn’t demand purity — it just makes purity available when you’re ready.

This matters for learning. I wasn’t fighting the language while also figuring out the algorithm. I could fall back on familiar patterns, then level up the code once the logic was solid.


What This Project Didn’t Test

Honesty requires noting the limitations of a single project:

  • Scale: This is ~550 lines. I don’t know how F# feels at 10,000 lines.
  • Interop: I didn’t need to call Python or C libraries. The MathNet.Numerics dependency was sufficient.
  • Concurrency: The implementation is single-threaded. F#’s async and parallelism story remains untested for me.
  • Discriminated unions: The problem didn’t call for complex state modeling. I used records and functions, not DUs.

A Note on Rigor

The penalty scheduling approach works well for exploration and practical use, but it’s an approximation. For work requiring strong theoretical guarantees — a submission to the Journal of Statistical Software, say, or a methods paper where you need to prove KKT conditions are satisfied to numerical tolerance — you’d want a proper SQP solver.

This opens two future directions:

  1. Interop with external solvers: Call IPOPT or SNOPT via P/Invoke or a subprocess. F# on .NET makes native interop feasible, though not trivial.

  2. Implement SQP in F#: A more ambitious project — but F#’s expressiveness for numerical code makes it less daunting than it would be in many languages. The QP subproblem could use an active-set method; the outer loop is just the schedule I already have, plus multiplier updates.

Neither was necessary for this project’s goal: learning F# through a meaningful implementation. But they’re natural next steps if c-MLE becomes a tool I use seriously.

Future projects will probe these edges. For now, I’ve validated that F# works for the core of what I do: implementing statistical methods.


Wrapping Up

c-MLE is now implemented, tested against synthetic data, and ready to apply to real problems. More importantly, I have evidence for the claim I made last week:

F# gives me executable mathematics — code that reflects the underlying reasoning, backed by a type system that eliminates entire categories of runtime bugs.

This wasn’t a toy example chosen to flatter the language. It was a 2024 paper, with real mathematical complexity, implemented from scratch. And F# made it easier than Python would have — not because Python couldn’t do it, but because F# caught mistakes earlier, composed more cleanly, and let me focus on the statistics instead of the scaffolding.

If you work in numerical or statistical programming and you’ve felt the friction I described — None checks everywhere, runtime type errors, code that doesn’t look like the math it implements — F# might be worth an afternoon of exploration.

The implementation is available on GitHub: https://github.com/SaehwanPark/c-mle-fp


References

Cao, Y., Ma, W., Zhao, G., McCarthy, A. M., & Chen, J. (2024). A constrained maximum likelihood approach to developing well-calibrated models for predicting binary outcomes. Lifetime Data Analysis, 30, 624–648. https://doi.org/10.1007/s10985-024-09628-9