How a lucky random number sequence hid a huge error until I tested on different platform — and what it taught me about production numerical code
Days ago I wrote about implementing c-MLE in F# as my first substantial project in the language. The code worked. Tests passed. The optimizer converged to reasonable parameter estimates with violations driven to machine precision.
Then I whimsically tested it on my M1 Max.
Same code. Same seed. Same data. The estimated parameter was off by 38%.
This is the story of how I debugged a silent, architecture-dependent failure — and what I learned about cross-platform validation, performance optimization, and the hidden assumptions in scientific computing.

The Discovery
I had validated the implementation on two x86-64 dev systems:
- AMD Ryzen 6800H running Fedora 43 in WSL2@Windows 11
- AMD Ryzen 7940HS running native Ubuntu 25.04
Both produced identical results:
|
|
A 2.2% relative error against the true value of 1.0 seemed reasonable given the complexity of the constrained optimization. The constraint violations were essentially zero. The optimizer converged smoothly over 8 iterations.
Everything looked perfect.
Then I ran the same code on my Mac Studio (M1 Max, macOS):
|
|
38% relative error. Not a small numerical discrepancy — a fundamentally different solution.
And yet: the optimizer still reported success. Constraint violations were still near zero. The code ran without errors or warnings.
This is the worst kind of bug. It’s silent. It looks correct. It passes all local validation. And it would fail catastrophically in production.
The Pattern
I ran the code on more systems to understand the pattern:
| Platform | Architecture | OS | Beta_Z | Error |
|---|---|---|---|---|
| AMD 6800H | x86-64 | WSL2/Fedora | 0.9778 | 2.2% |
| AMD 7940HS | x86-64 | Ubuntu 25.04 | 0.9778 | 2.2% |
| M1 Max | ARM64 | macOS | 0.6198 | 38% |
The discrepancy wasn’t OS-specific, virtualization-specific, or vendor-specific.
It was architecture-specific.
All x86-64 systems produced identical results. The ARM64 system produced a different result. This held across Windows (WSL2), native Linux, and macOS.
Root Cause: Architecture-Dependent Randomness
The original implementation used .NET’s System.Random for Monte Carlo integration:
|
|
I passed seed = 42 explicitly, expecting reproducibility. And I got reproducibility — within each architecture.
But System.Random(42) produces completely different sequences on x86-64 vs ARM64.
Why This Happens
.NET’s System.Random implementation uses different algorithms depending on the target architecture:
- x86-64 runtime: Subtractive generator (Knuth’s Algorithm M, legacy)
- ARM64 runtime (.NET 6+): xoshiro256** (optimized for ARM)
Microsoft introduced xoshiro256** on ARM64 for performance reasons — it’s faster on ARM’s instruction set and takes advantage of ARM-specific features. But this optimization breaks cross-architecture reproducibility.
Same seed → different random sequences → optimizer explores different parameter space → converges to different local minima.
The Optimization Divergence
The constraint optimization uses Monte Carlo sampling to evaluate expected risk within calibration intervals. Random noise directly influences which parameter regions the optimizer explores.
On x86-64 systems (lucky noise sequence):
|
|
On ARM64 system (unlucky noise sequence):
|
|
Both converged. Both satisfied constraints. Both looked successful.
But ARM64 found a statistically plausible but wrong solution: the likelihood was reasonable, the constraints were satisfied, but the estimate was off by 38%.
Why This Matters for Production
This issue has critical implications for modern cloud deployments, where ARM instances offer significant cost advantages:
ARM adoption in production:
- AWS Graviton3/4 (up to 40% cheaper than x86)
- Azure Ampere Altra
- Google Cloud Tau T2A
- Oracle Cloud Ampere A1
- All edge and mobile deployment
The deployment scenario:
|
|
Your tests pass on x86-64. Deployment succeeds without errors. The application runs fine. But the model produces incorrect predictions in production because it deployed to ARM instances.
This is a silent, catastrophic failure that would be extremely difficult to debug. The logs would show successful optimization. The constraint violations would be zero. There would be no stack trace, no warning, no obvious signal that anything was wrong — just systematically incorrect predictions.
The Fix: Architecture-Independent Randomness
The solution is to use a random number generator that produces identical sequences across all architectures:
|
|
MersenneTwister is implemented in pure managed code with no architecture-specific optimizations:
- Same algorithm on x86-64
- Same algorithm on ARM64
- Same algorithm on any architecture .NET supports
- No SIMD vectorization that might differ across ISAs
- No native code branches based on CPU features
Result across all platforms:
|
|
Identical results. 2.1% relative error. Reproducible across architectures.
It’s slightly slower than System.Random on ARM64 (no xoshiro256** optimization), but correctness trumps speed for scientific computing.
Bonus Discovery: Vectorization and Performance
While debugging the reproducibility issue, I also refactored from list-based processing to matrix-based operations. The performance impact was dramatic — but asymmetrically so.
Performance Results
| Platform | Original (List) | Vectorized (Matrix) | Speedup |
|---|---|---|---|
| AMD 6800H (WSL2) | 52s | 28s | 1.86× |
| M1 Max (macOS) | 46s | 6s | 7.67× |
The x86-64 improvement was respectable. The ARM64 improvement was transformative.
Why ARM64 Gains Were So Dramatic
The list-based implementation had inherent serialization:
|
|
Bottlenecks:
- List allocations cannot be vectorized (cons cells processed sequentially)
- Each vector operation too small to engage hardware acceleration (~100 elements)
List.mapcreates intermediate allocations- Operations below threshold for optimized BLAS routines
The M1 Max’s architectural advantages barely helped:
- ✅ Unified memory: Slight benefit (fewer cache misses)
- ❌ AMX matrix extensions: Not engaged (no matrices, only small vectors)
- ❌ 400 GB/s memory bandwidth: Underutilized (not streaming large data)
Result: M1 Max only 13% faster than x86-64 despite hardware superiority.
Vectorization Unleashes Hardware
The matrix-based implementation processes the entire dataset at once:
|
|
Matrix operations (500 × 100) are perfectly sized for modern hardware:
AMX (Apple Matrix Extensions):
- AMD (AVX2): 4 FP64 operations/cycle
- M1 (AMX): 32 FP64 operations/cycle (8× advantage)
Unified Memory Bandwidth:
- AMD 6800H (WSL2): ~50 GB/s (DDR5 + virtualization overhead)
- M1 Max: 400 GB/s (LPDDR5 unified, no virtualization)
Accelerate Framework:
MathNet.Numerics on macOS delegates to Apple’s optimized BLAS. For matrix operations, this means:
- Hand-tuned assembly for M1
- Full utilization of AMX instructions
- Memory access patterns optimized for unified memory
The 7.67× speedup on ARM64 reflects how well-matched the problem became to the hardware once vectorized.
The Statistical Insight: Overfitting Constraints
While debugging, I noticed something subtle. Both the x86-64 and ARM64 versions converged to “violation ≈ 0” — but one found the right solution and one didn’t.
The original implementation drove violations to machine precision (~$10^{-8}$). The optimizer was over-fitting the constraints rather than finding the true maximum likelihood solution subject to realistic constraint satisfaction.
The Optimization Problem Structure
The objective function has two competing components:
$$ J(\theta) = \underbrace{-\log \mathcal{L}(\theta)}_{\text{negative log-likelihood}} + \underbrace{\rho \sum_r \max(0, g_r(\theta))^2}_{\text{constraint penalty}} $$We’re trying to minimize this. That means:
- Likelihood term wants to fit the data well (find $\hat{\beta}_z \approx 1.0$)
- Penalty term wants to satisfy calibration constraints
These objectives can conflict. The likelihood might prefer parameters that violate constraints; the constraints might prefer parameters that reduce likelihood.
Why the Original Implementation Was Fragile
Initially, $\beta_z$ starts at 0.0. The true value is 1.0. The optimizer needs to incrementally correct this bias while maintaining constraint satisfaction.
On x86-64 (lucky PRNG sequence):
The specific random noise created a constraint landscape where the optimizer could move from $\beta_z = 0.0 \to 0.85 \to 0.978$ while keeping constraints feasible at each step. The path looked like:
|
|
The optimizer could balance both objectives naturally because the noise happened to create a smooth path through the objective landscape.
On ARM64 (different PRNG, same seed):
Different random noise → different constraint landscape. The optimizer tried the same strategy but encountered a different optimization surface:
|
|
The optimizer got trapped. To reduce violations quickly, it chose a path that satisfied constraints at the expense of likelihood fit. Once violations were near zero, the penalty term dominated — any move that would improve $\beta_z$ toward 1.0 would temporarily violate constraints, increasing the objective. The optimizer converged to a locally optimal but globally poor solution.
Both reported “success.” Both achieved near-zero violations. But one found the right parameter, one didn’t.
Why This Is Overfitting
The problem isn’t that the optimizer failed — it’s that we asked it to optimize the wrong objective.
Driving violations to machine precision ($\sim 10^{-15}$) is methodologically incoherent. The calibration constraints reflect statistical uncertainty in the target population. We don’t actually know the true expected risk to 15 decimal places.
By demanding exact constraint satisfaction, we forced the optimizer to prioritize numerical precision over statistical validity. It found parameters that satisfy constraints perfectly but don’t reflect the true data-generating process.
This is overfitting in a different guise: not overfitting to data, but overfitting to numerical precision of constraints.
The fix: Two changes to the optimization strategy.
1. Realistic Tolerance
Instead of driving violations to zero:
|
|
I adopted a tolerance that reflects actual calibration uncertainty:
|
|
This tells the optimizer: “Get close enough to satisfy statistical calibration. Don’t overfit numerical precision.”
2. Warm Start Strategy
More importantly, I changed the optimization path to avoid the fragile balancing act.
Old approach: Start from zero, try to balance likelihood and constraints simultaneously.
New approach: Decompose into two sequential phases.
Phase 1 — Unconstrained MLE:
|
|
Phase 2 — Constrained refinement:
|
|
This creates a robust optimization path that doesn’t depend on lucky noise:
- First, find the likelihood maximum (no constraints) → $\beta_z \approx 1.68$
- Then, pull back toward constraint satisfaction → $\beta_z \approx 1.02$
The warm start ensures we explore the high-likelihood region before the penalty term can trap us in a poor local minimum. The likelihood gradient is strong enough to guide us near the true solution, then we apply constraints to satisfy calibration.
Result after fixes:
|
|
Identical results. Better accuracy. Faster convergence. And crucially: robust to different PRNG sequences.
The original implementation only worked because it got lucky with the x86-64 random noise. The new implementation works because it structures the optimization to avoid the pathology entirely.
Production Readiness: What I Learned
Lesson 1: Cross-Architecture Validation Is Not Optional
In 2025, with ARM’s rapid adoption in cloud computing (cost) and development (Apple Silicon), any scientific computing code must be tested on both x86-64 and ARM64.
Recommended CI/CD setup:
|
|
What works on x86 may silently fail on ARM, and vice versa.
Lesson 2: Beware Architecture-Specific Optimizations
System.Random isn’t malicious — Microsoft optimized it for ARM64 performance. But that optimization broke reproducibility.
Other examples to watch:
- SIMD intrinsics with different implementations per architecture
- Native library calls that behave differently (BLAS, LAPACK)
- Floating-point operations with architecture-dependent rounding
- Parallel scheduling that depends on CPU topology
When reproducibility matters, prefer pure managed implementations or explicitly validate across architectures.
Lesson 3: Vectorization Benefits Scale with Hardware
The 1.86× speedup on x86-64 was nice. The 7.67× speedup on ARM64 was game-changing.
Modern hardware — especially ARM64 with unified memory and matrix extensions — strongly rewards vectorized operations. List-based functional code is elegant, but it leaves performance on the table.
F# made the refactoring straightforward: the type system ensured correctness, and the expressiveness let me preserve clarity while moving to matrix operations.
Lesson 4: Statistical Rigor Requires Realistic Tolerances
Driving constraint violations to machine precision isn’t rigor — it’s overfitting. The optimizer should satisfy constraints statistically, not numerically.
In hindsight, this is obvious. Calibration constraints reflect uncertainty in the target population. Pretending we need 15-digit precision is methodologically incoherent.
But it’s easy to miss when the optimizer reports “success” and violations show zero.
The Final Implementation
The corrected implementation:
- ✅ Architecture-independent: MersenneTwister for reproducible randomness
- ✅ Vectorized: Matrix operations that leverage modern hardware
- ✅ Statistically sound: Realistic tolerance + warm start strategy
- ✅ Production-ready: Validated across x86-64 and ARM64
Performance across platforms:
|
|
Accuracy across platforms:
|
|
Identical results. Fast. Correct.
What This Cost Me
Time debugging: ~8 hours spread over two days.
What I learned:
- How .NET runtime varies across architectures
- When hardware acceleration actually matters
- Why cross-platform validation is mandatory
- The difference between numerical and statistical rigor
Was it worth it? Absolutely.
This is the kind of bug that would have shipped to production. It would have passed all tests. It would have run without errors. And it would have produced incorrect predictions on ARM instances — the most cost-effective deployment target.
Catching it during development, in a learning project where the stakes are low, is far better than discovering it in production where the stakes are high.
Takeaways for Practitioners
If you write numerical or scientific code:
-
Test on multiple architectures — especially if you deploy to ARM-based cloud infrastructure.
-
Prefer architecture-independent RNGs — MersenneTwister, PCG, or explicit PRNG implementations that don’t vary by platform.
-
Validate reproducibility explicitly — generate golden sequences on one platform, verify they match on others.
-
Vectorize when it matters — list-based functional code is elegant, but modern hardware rewards matrix operations. F#’s expressiveness makes both accessible.
-
Match numerical precision to statistical uncertainty — driving violations to machine precision isn’t rigor, it’s overfitting.
-
Make debugging part of the learning — the mistakes you catch in side projects are lessons you don’t have to learn in production.
Code Availability
Both versions (original and vectorized) are available in the GitHub repository with reproduction scripts for cross-platform validation.
Run dotnet run on x86-64 and ARM64 to verify identical results.
References
- Original implementation: My First F# Project: Implementing c-MLE
- Cao et al. (2024): Constrained Maximum Likelihood for Calibrated Risk Prediction
- .NET Random implementation: Microsoft Docs
- MersenneTwister: MathNet.Numerics Documentation