How a lucky random number sequence hid a huge error until I tested on different platform — and what it taught me about production numerical code

Days ago I wrote about implementing c-MLE in F# as my first substantial project in the language. The code worked. Tests passed. The optimizer converged to reasonable parameter estimates with violations driven to machine precision.

Then I whimsically tested it on my M1 Max.

Same code. Same seed. Same data. The estimated parameter was off by 38%.

This is the story of how I debugged a silent, architecture-dependent failure — and what I learned about cross-platform validation, performance optimization, and the hidden assumptions in scientific computing.

Overfitting vs. Underfitting vs. Optimal


The Discovery

I had validated the implementation on two x86-64 dev systems:

  • AMD Ryzen 6800H running Fedora 43 in WSL2@Windows 11
  • AMD Ryzen 7940HS running native Ubuntu 25.04

Both produced identical results:

1
2
3
Success! Beta_Z: 0.9778 (True ~1.0)
Density Sigma: 0.3661
Max Constraint Violation: 0.000000

A 2.2% relative error against the true value of 1.0 seemed reasonable given the complexity of the constrained optimization. The constraint violations were essentially zero. The optimizer converged smoothly over 8 iterations.

Everything looked perfect.

Then I ran the same code on my Mac Studio (M1 Max, macOS):

1
2
3
Success! Beta_Z: 0.6198 (True ~1.0)
Density Sigma: 0.3630
Max Constraint Violation: 0.000034

38% relative error. Not a small numerical discrepancy — a fundamentally different solution.

And yet: the optimizer still reported success. Constraint violations were still near zero. The code ran without errors or warnings.

This is the worst kind of bug. It’s silent. It looks correct. It passes all local validation. And it would fail catastrophically in production.


The Pattern

I ran the code on more systems to understand the pattern:

Platform Architecture OS Beta_Z Error
AMD 6800H x86-64 WSL2/Fedora 0.9778 2.2%
AMD 7940HS x86-64 Ubuntu 25.04 0.9778 2.2%
M1 Max ARM64 macOS 0.6198 38%

The discrepancy wasn’t OS-specific, virtualization-specific, or vendor-specific.

It was architecture-specific.

All x86-64 systems produced identical results. The ARM64 system produced a different result. This held across Windows (WSL2), native Linux, and macOS.


Root Cause: Architecture-Dependent Randomness

The original implementation used .NET’s System.Random for Monte Carlo integration:

1
2
3
4
5
6
7
8
let create_integration_context target_x n_samples seed =
    let rng = Random(seed)
    let noise =
        target_x
        |> List.map (fun _ ->
            let samples = Array.init n_samples (fun _ -> rng.NextDouble())
            Vector.Build.DenseOfArray samples)
    { noise_vectors = noise; n_integration_samples = n_samples }

I passed seed = 42 explicitly, expecting reproducibility. And I got reproducibility — within each architecture.

But System.Random(42) produces completely different sequences on x86-64 vs ARM64.

Why This Happens

.NET’s System.Random implementation uses different algorithms depending on the target architecture:

  • x86-64 runtime: Subtractive generator (Knuth’s Algorithm M, legacy)
  • ARM64 runtime (.NET 6+): xoshiro256** (optimized for ARM)

Microsoft introduced xoshiro256** on ARM64 for performance reasons — it’s faster on ARM’s instruction set and takes advantage of ARM-specific features. But this optimization breaks cross-architecture reproducibility.

Same seed → different random sequences → optimizer explores different parameter space → converges to different local minima.

The Optimization Divergence

The constraint optimization uses Monte Carlo sampling to evaluate expected risk within calibration intervals. Random noise directly influences which parameter regions the optimizer explores.

On x86-64 systems (lucky noise sequence):

1
2
3
Iteration 1:  Violation = 0.163
Iteration 4:  Violation = 0.013, Beta_Z ≈ 0.85
Iteration 8:  Violation = 0.000, Beta_Z = 0.978  ✓

On ARM64 system (unlucky noise sequence):

1
2
3
Iteration 1:  Violation = 0.163
Iteration 4:  Violation = 0.013, Beta_Z ≈ 0.65
Iteration 8:  Violation = 0.000, Beta_Z = 0.620  ✗

Both converged. Both satisfied constraints. Both looked successful.

But ARM64 found a statistically plausible but wrong solution: the likelihood was reasonable, the constraints were satisfied, but the estimate was off by 38%.


Why This Matters for Production

This issue has critical implications for modern cloud deployments, where ARM instances offer significant cost advantages:

ARM adoption in production:

  • AWS Graviton3/4 (up to 40% cheaper than x86)
  • Azure Ampere Altra
  • Google Cloud Tau T2A
  • Oracle Cloud Ampere A1
  • All edge and mobile deployment

The deployment scenario:

1
2
3
Developer laptop (x86-64):     Beta_Z = 0.978  ✓
CI/CD (x86-64 runners):        Beta_Z = 0.978  ✓ Tests pass
Production (ARM for cost):     Beta_Z = 0.620  ✗ Silent failure

Your tests pass on x86-64. Deployment succeeds without errors. The application runs fine. But the model produces incorrect predictions in production because it deployed to ARM instances.

This is a silent, catastrophic failure that would be extremely difficult to debug. The logs would show successful optimization. The constraint violations would be zero. There would be no stack trace, no warning, no obvious signal that anything was wrong — just systematically incorrect predictions.


The Fix: Architecture-Independent Randomness

The solution is to use a random number generator that produces identical sequences across all architectures:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
open MathNet.Numerics.Random

let create_integration_context target_x n_samples seed =
    let rng = MersenneTwister(seed)  // Architecture-independent
    let noise =
        target_x
        |> List.map (fun _ ->
            let samples = Array.init n_samples (fun _ -> rng.NextDouble())
            Vector.Build.DenseOfArray samples)
    { noise_vectors = noise; n_integration_samples = n_samples }

MersenneTwister is implemented in pure managed code with no architecture-specific optimizations:

  • Same algorithm on x86-64
  • Same algorithm on ARM64
  • Same algorithm on any architecture .NET supports
  • No SIMD vectorization that might differ across ISAs
  • No native code branches based on CPU features

Result across all platforms:

1
2
3
AMD 6800H (x86-64):   Beta_Z = 1.0216  ✓
AMD 7940HS (x86-64):  Beta_Z = 1.0216  ✓
M1 Max (ARM64):       Beta_Z = 1.0216  ✓

Identical results. 2.1% relative error. Reproducible across architectures.

It’s slightly slower than System.Random on ARM64 (no xoshiro256** optimization), but correctness trumps speed for scientific computing.


Bonus Discovery: Vectorization and Performance

While debugging the reproducibility issue, I also refactored from list-based processing to matrix-based operations. The performance impact was dramatic — but asymmetrically so.

Performance Results

Platform Original (List) Vectorized (Matrix) Speedup
AMD 6800H (WSL2) 52s 28s 1.86×
M1 Max (macOS) 46s 6s 7.67×

The x86-64 improvement was respectable. The ARM64 improvement was transformative.

Why ARM64 Gains Were So Dramatic

The list-based implementation had inherent serialization:

1
2
3
4
5
6
7
// Original: row-by-row processing
let risks =
    noise_vectors
    |> List.map (fun noise_vec ->
        let z_samples = sample_truncated_deterministic density x noise_vec
        z_samples.Map (fun z -> sigmoid(...))
    )

Bottlenecks:

  • List allocations cannot be vectorized (cons cells processed sequentially)
  • Each vector operation too small to engage hardware acceleration (~100 elements)
  • List.map creates intermediate allocations
  • Operations below threshold for optimized BLAS routines

The M1 Max’s architectural advantages barely helped:

  • ✅ Unified memory: Slight benefit (fewer cache misses)
  • ❌ AMX matrix extensions: Not engaged (no matrices, only small vectors)
  • ❌ 400 GB/s memory bandwidth: Underutilized (not streaming large data)

Result: M1 Max only 13% faster than x86-64 despite hardware superiority.

Vectorization Unleashes Hardware

The matrix-based implementation processes the entire dataset at once:

1
2
3
4
5
6
7
8
// Vectorized: entire dataset in single matrix operations
let z_samples_mat =
    MathHelpers.sample_truncated_matrix params'.density x_aug context.noise_matrix

let risk_mat = Matrix.Build.Dense(rows, cols, fun r c ->
    let eta = x_logits.[r] + (params'.beta_z * z_samples_mat.[r, c])
    MathHelpers.sigmoid eta
)

Matrix operations (500 × 100) are perfectly sized for modern hardware:

AMX (Apple Matrix Extensions):

  • AMD (AVX2): 4 FP64 operations/cycle
  • M1 (AMX): 32 FP64 operations/cycle (8× advantage)

Unified Memory Bandwidth:

  • AMD 6800H (WSL2): ~50 GB/s (DDR5 + virtualization overhead)
  • M1 Max: 400 GB/s (LPDDR5 unified, no virtualization)

Accelerate Framework:

MathNet.Numerics on macOS delegates to Apple’s optimized BLAS. For matrix operations, this means:

  • Hand-tuned assembly for M1
  • Full utilization of AMX instructions
  • Memory access patterns optimized for unified memory

The 7.67× speedup on ARM64 reflects how well-matched the problem became to the hardware once vectorized.


The Statistical Insight: Overfitting Constraints

While debugging, I noticed something subtle. Both the x86-64 and ARM64 versions converged to “violation ≈ 0” — but one found the right solution and one didn’t.

The original implementation drove violations to machine precision (~$10^{-8}$). The optimizer was over-fitting the constraints rather than finding the true maximum likelihood solution subject to realistic constraint satisfaction.

The Optimization Problem Structure

The objective function has two competing components:

$$ J(\theta) = \underbrace{-\log \mathcal{L}(\theta)}_{\text{negative log-likelihood}} + \underbrace{\rho \sum_r \max(0, g_r(\theta))^2}_{\text{constraint penalty}} $$

We’re trying to minimize this. That means:

  • Likelihood term wants to fit the data well (find $\hat{\beta}_z \approx 1.0$)
  • Penalty term wants to satisfy calibration constraints

These objectives can conflict. The likelihood might prefer parameters that violate constraints; the constraints might prefer parameters that reduce likelihood.

Why the Original Implementation Was Fragile

Initially, $\beta_z$ starts at 0.0. The true value is 1.0. The optimizer needs to incrementally correct this bias while maintaining constraint satisfaction.

On x86-64 (lucky PRNG sequence):

The specific random noise created a constraint landscape where the optimizer could move from $\beta_z = 0.0 \to 0.85 \to 0.978$ while keeping constraints feasible at each step. The path looked like:

1
2
3
Iteration 1: β_z ≈ 0.2, likelihood improves, violations = 0.16
Iteration 4: β_z ≈ 0.85, likelihood good, violations = 0.01
Iteration 8: β_z = 0.978, likelihood optimal, violations ≈ 0

The optimizer could balance both objectives naturally because the noise happened to create a smooth path through the objective landscape.

On ARM64 (different PRNG, same seed):

Different random noise → different constraint landscape. The optimizer tried the same strategy but encountered a different optimization surface:

1
2
3
Iteration 1: β_z ≈ 0.2, likelihood improves, violations = 0.16
Iteration 4: β_z ≈ 0.65, violations dropping faster than likelihood improving
Iteration 8: β_z = 0.620, violations ≈ 0, but likelihood suboptimal

The optimizer got trapped. To reduce violations quickly, it chose a path that satisfied constraints at the expense of likelihood fit. Once violations were near zero, the penalty term dominated — any move that would improve $\beta_z$ toward 1.0 would temporarily violate constraints, increasing the objective. The optimizer converged to a locally optimal but globally poor solution.

Both reported “success.” Both achieved near-zero violations. But one found the right parameter, one didn’t.

Why This Is Overfitting

The problem isn’t that the optimizer failed — it’s that we asked it to optimize the wrong objective.

Driving violations to machine precision ($\sim 10^{-15}$) is methodologically incoherent. The calibration constraints reflect statistical uncertainty in the target population. We don’t actually know the true expected risk to 15 decimal places.

By demanding exact constraint satisfaction, we forced the optimizer to prioritize numerical precision over statistical validity. It found parameters that satisfy constraints perfectly but don’t reflect the true data-generating process.

This is overfitting in a different guise: not overfitting to data, but overfitting to numerical precision of constraints.

The fix: Two changes to the optimization strategy.

1. Realistic Tolerance

Instead of driving violations to zero:

1
let violation_tolerance = 1e-8  // Machine precision (WRONG)

I adopted a tolerance that reflects actual calibration uncertainty:

1
let violation_tolerance = 0.05  // 5% of target risk (CORRECT)

This tells the optimizer: “Get close enough to satisfy statistical calibration. Don’t overfit numerical precision.”

2. Warm Start Strategy

More importantly, I changed the optimization path to avoid the fragile balancing act.

Old approach: Start from zero, try to balance likelihood and constraints simultaneously.

New approach: Decompose into two sequential phases.

Phase 1 — Unconstrained MLE:

1
2
3
4
// Maximize likelihood ignoring constraints
// This finds the region of parameter space with good data fit
let warm_start = fit_unconstrained_mle data density
// Result: beta_z ≈ 1.68 (upward biased, violates constraints)

Phase 2 — Constrained refinement:

1
2
3
4
// Starting from warm_start, apply constraints
// Penalty pulls parameters back toward feasible region
let final = optimize_with_constraints data calib base_model warm_start
// Result: beta_z ≈ 1.02 (correct, satisfies constraints)

This creates a robust optimization path that doesn’t depend on lucky noise:

  1. First, find the likelihood maximum (no constraints) → $\beta_z \approx 1.68$
  2. Then, pull back toward constraint satisfaction → $\beta_z \approx 1.02$

The warm start ensures we explore the high-likelihood region before the penalty term can trap us in a poor local minimum. The likelihood gradient is strong enough to guide us near the true solution, then we apply constraints to satisfy calibration.

Result after fixes:

1
2
AMD 6800H (x86-64):   Beta_Z = 1.0216 (2.1% error)
M1 Max (ARM64):       Beta_Z = 1.0216 (2.1% error)

Identical results. Better accuracy. Faster convergence. And crucially: robust to different PRNG sequences.

The original implementation only worked because it got lucky with the x86-64 random noise. The new implementation works because it structures the optimization to avoid the pathology entirely.


Production Readiness: What I Learned

Lesson 1: Cross-Architecture Validation Is Not Optional

In 2025, with ARM’s rapid adoption in cloud computing (cost) and development (Apple Silicon), any scientific computing code must be tested on both x86-64 and ARM64.

Recommended CI/CD setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
jobs:
  test-x86:
    runs-on: ubuntu-latest
    steps:
      - run: dotnet test
      - run: dotnet run --configuration Release
  
  test-arm:
    runs-on: macos-14  # M1 runners
    steps:
      - run: dotnet test
      - run: dotnet run --configuration Release
  
  validate-reproducibility:
    needs: [test-x86, test-arm]
    steps:
      - name: Compare results across architectures
        run: |
          # Assert outputs are identical within tolerance

What works on x86 may silently fail on ARM, and vice versa.

Lesson 2: Beware Architecture-Specific Optimizations

System.Random isn’t malicious — Microsoft optimized it for ARM64 performance. But that optimization broke reproducibility.

Other examples to watch:

  • SIMD intrinsics with different implementations per architecture
  • Native library calls that behave differently (BLAS, LAPACK)
  • Floating-point operations with architecture-dependent rounding
  • Parallel scheduling that depends on CPU topology

When reproducibility matters, prefer pure managed implementations or explicitly validate across architectures.

Lesson 3: Vectorization Benefits Scale with Hardware

The 1.86× speedup on x86-64 was nice. The 7.67× speedup on ARM64 was game-changing.

Modern hardware — especially ARM64 with unified memory and matrix extensions — strongly rewards vectorized operations. List-based functional code is elegant, but it leaves performance on the table.

F# made the refactoring straightforward: the type system ensured correctness, and the expressiveness let me preserve clarity while moving to matrix operations.

Lesson 4: Statistical Rigor Requires Realistic Tolerances

Driving constraint violations to machine precision isn’t rigor — it’s overfitting. The optimizer should satisfy constraints statistically, not numerically.

In hindsight, this is obvious. Calibration constraints reflect uncertainty in the target population. Pretending we need 15-digit precision is methodologically incoherent.

But it’s easy to miss when the optimizer reports “success” and violations show zero.


The Final Implementation

The corrected implementation:

  • Architecture-independent: MersenneTwister for reproducible randomness
  • Vectorized: Matrix operations that leverage modern hardware
  • Statistically sound: Realistic tolerance + warm start strategy
  • Production-ready: Validated across x86-64 and ARM64

Performance across platforms:

1
2
3
AMD 6800H (x86-64, WSL2):  28 seconds
AMD 7940HS (x86-64, Ubuntu): 27 seconds
M1 Max (ARM64, macOS):     6 seconds

Accuracy across platforms:

1
2
3
4
Beta_Z estimate: 1.0216
True value:      1.0000
Relative error:  2.1%
Constraint violations: < 5%

Identical results. Fast. Correct.


What This Cost Me

Time debugging: ~8 hours spread over two days.

What I learned:

  • How .NET runtime varies across architectures
  • When hardware acceleration actually matters
  • Why cross-platform validation is mandatory
  • The difference between numerical and statistical rigor

Was it worth it? Absolutely.

This is the kind of bug that would have shipped to production. It would have passed all tests. It would have run without errors. And it would have produced incorrect predictions on ARM instances — the most cost-effective deployment target.

Catching it during development, in a learning project where the stakes are low, is far better than discovering it in production where the stakes are high.


Takeaways for Practitioners

If you write numerical or scientific code:

  1. Test on multiple architectures — especially if you deploy to ARM-based cloud infrastructure.

  2. Prefer architecture-independent RNGs — MersenneTwister, PCG, or explicit PRNG implementations that don’t vary by platform.

  3. Validate reproducibility explicitly — generate golden sequences on one platform, verify they match on others.

  4. Vectorize when it matters — list-based functional code is elegant, but modern hardware rewards matrix operations. F#’s expressiveness makes both accessible.

  5. Match numerical precision to statistical uncertainty — driving violations to machine precision isn’t rigor, it’s overfitting.

  6. Make debugging part of the learning — the mistakes you catch in side projects are lessons you don’t have to learn in production.


Code Availability

Both versions (original and vectorized) are available in the GitHub repository with reproduction scripts for cross-platform validation.

Run dotnet run on x86-64 and ARM64 to verify identical results.


References