Python vs. Rust: Aren't We Asking the Wrong Question?

The practical question is where your hot path lives: in Python code, native libraries, or custom stateful logic.

Rust vs Python

The Rewrite Question Usually Starts Too Late

Suppose a data pipeline is slower than you want.

It reads a few million rows, computes some derived columns, joins a dimension table, aggregates by user segment, and then runs a small amount of business logic over events. The code is in Python. Someone on the team has been learning Rust. The suggestion comes up quickly: maybe this should be rewritten.

That is a reasonable instinct. Rust gives you native code, predictable memory behavior, no interpreter in the hot loop, and a good story for parallelism. I do not need convincing that Rust is fast. In a few small applications I have rewritten, Rust was multiple times faster than the Python version and sometimes close enough to C-like performance that the remaining difference stopped mattering for the product.

That part of the story is not controversial. Plain Python is on the slow side for tight loops. Rust is a performant systems language. If the benchmark is a Python for loop against compiled Rust, the result is usually predictable before the code is written.

Data science and data engineering complicate that simple comparison. In these domains, Python is often not doing the expensive work directly. NumPy arrays call into native kernels. Polars and DuckDB execute query plans in highly optimized engines. PyArrow moves data through Arrow memory formats. BLAS, SIMD kernels, thread pools, and columnar execution do a lot of the work that people casually attribute to “Python.”

That is what made the question interesting to me. If Python is merely orchestrating libraries written in C, C++, Rust, Fortran, or similarly low-level code, is Python still slow in the way we usually mean? Or does the answer depend on whether the expensive part of the program has escaped Python already?

So the first performance question is not really “Python or Rust?” It is more specific: which part of the program is actually doing the work?

If Python is only describing a query plan that Polars executes in native code, then the comparison is not Python bytecode against Rust machine code. It is a Python frontend to a native engine against some other native implementation. If NumPy owns the inner loop, then the expensive math is already outside ordinary Python. If the code is a row-by-row loop with parsing, branching, mutable dictionaries, and per-user state, then Python’s interpreter and object model are much more exposed.

That distinction is easy to say and surprisingly easy to forget. Many Python-versus-Rust performance articles flatten the problem into a language contest. A Python-to-Rust rewrite is not one performance move. It is several different moves wearing the same name.

For this benchmark, I found it more useful to separate three execution modes:

native-backed query or library execution, where Python mostly builds a plan for engines like Polars, DuckDB, or Arrow-backed libraries
vectorized or parallel numerical execution, where the comparison is about kernels, memory layout, and thread-level parallelism
interpreter-bound custom logic, where ordinary Python loops, UDFs, state machines, parsing, and mutable objects sit in the hot path

I wanted to see what happened if the benchmark separated those cases instead of treating the repository language as the whole story.

The experiment archive lives in the dedicated repository. The short version is this: Python looked excellent when mature native libraries owned the work. Rust looked excellent when custom loop-heavy logic owned the work. The useful lesson was how quickly the answer changed when the shape of the hot path changed.

I Wanted A Benchmark That Did Not Cheat For Either Side

A bad benchmark for this question would compare a pure Python loop against optimized Rust and then declare Rust faster. That result would be true, but not very useful. Most serious Python data work does not use pure Python loops for dataframe aggregation or vectorized numerical transforms.

The opposite benchmark is also misleading. If Python calls an optimized C, C++, or Rust-backed library and Rust uses a less polished hand-written implementation, Python can win for reasons that have little to do with Python itself.

So I built the benchmark around execution patterns rather than around language labels.

The first case that came to mind was multidimensional array computation. This is the classic Python data-science success story: write array expressions in Python, let NumPy run the hot path in native code. It belongs to the vectorized numerical mode. If Rust easily beat NumPy there, that would say something. If it did not, that would also say something, especially because the Python implementation would not be ordinary Python in the performance-critical section.

The second case was dataframes. In modern Python data work, many pipelines are written as dataframe transformations rather than explicit loops. Groupby and join are good representatives because they are common, expensive enough to matter, and heavily optimized in mature engines. This belongs to the native-backed query mode. I used Python Polars for the Python side, Rust Polars for the closest library-to-library comparison, and native Rust implementations as an additional reference point.

Those two cases cover the most Python-friendly version of the story: Python as a high-level interface over optimized kernels and query engines. They do not cover the work that often hurts in production. Real pipelines also contain logic that is awkward to express as a vectorized operation: session boundaries, event-type branches, mutable per-user state, time gaps, score corrections, alert counters, and other pieces of business logic that accumulate around the elegant dataframe core.

That led to the third case: loop-heavy sessionization. It is still a data-engineering task, but the workload is no longer a clean aggregation. Each row changes state. The next decision depends on the previous event. This belongs to the interpreter-bound custom-logic mode.

The fourth extension was a streaming-style scoring task. I wanted something closer to event enrichment or online feature calculation: append-ordered events, a per-user state map, time decay, event-dependent score updates, and alert counting. This is not meant to be a Polars benchmark. It represents the kind of custom state machine that appears after the easy dataframe work is already done.

The resulting suite has five tasks:

Workload	Python path	Rust path	Pattern represented
numeric kernel	NumPy	native Rust, Rust + Rayon	vectorized numerical transform over 50 million values
groupby	Python Polars	Rust Polars, native Rust	dataframe aggregation over 5 million fact rows
join	Python Polars	Rust Polars, native Rust	dataframe join and aggregation over fact and dimension tables
sessionization	Python CSV loop	native Rust loop	sorted event stream with per-session mutable state
streaming score	Python CSV loop	native Rust loop	append-ordered events with per-user state and alert logic

The dataframe tasks were meant to be fair to Python. Python uses Polars lazy queries, projects only the needed columns, computes derived values, and lets the engine execute the plan. The Rust side includes Rust Polars and also hand-written native versions for groupby and join.

The loop-heavy tasks were meant to be fair to Rust. Sessionization and streaming-score are intentionally less dataframe-friendly. They include CSV parsing, branches on event type, mutable state, hash-map updates, time gaps, score decay, and alert counters. That is the kind of logic that often starts as ordinary Python because it is easy to write and easy to change, then becomes painful when it moves into the critical path.

The suite was run on five machine/software stacks: Apple M1 Max, Apple M3 Max, Ryzen 9 7940HS, Ryzen AI Max+ 395 on native Fedora, and the same Ryzen AI Max+ 395 class under WSL2. Each task/engine group had 20 timing rows per host, and the analysis uses medians rather than best-case times.

There are a few benchmark assumptions worth keeping in mind before looking at the results. Timings are elapsed wall-clock time, and CSV parsing is included in the measured tasks. Thread defaults for NumPy, Polars, and Rayon were not fully normalized across all environments. Row-oriented CSV parsing and columnar query execution also stress different parts of the stack. Those choices make the benchmark closer to a small data-engineering workflow than to a controlled microbenchmark, but they also mean the numbers should be read as workload-specific.

The Polars Result Was A Frontend Story

For the two dataframe workloads, Python Polars was the fastest measured implementation on every host.

Median runtime by task and engine

On the groupby task, the best Python Polars median was 0.0684 seconds on the Apple M3 Max. On the join task, the best Python Polars median was 0.1240 seconds on the same machine. Rust Polars and the native Rust groupby/join implementations were slower in these runs.

This is not Python beating Rust at Rust’s own game. It is a frontend result. Once the work is expressed as a high-level query and handed to a native engine, the language at the call site may stop being the limiting factor. The remaining differences can come from frontend maturity, defaults, build features, CSV scan behavior, thread-pool configuration, version differences, and plan details.

In this benchmark, the Python code for groupby is mostly plan construction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


(
  pl.scan_csv(fact_path)
  .select(["region", "channel", "amount", "discount"])
  .with_columns(((pl.col("amount") * (1.0 - pl.col("discount"))).alias("net_amount")))
  .group_by(["region", "channel"])
  .agg(
    pl.len().alias("n"),
    pl.col("amount").sum().alias("gross_sum"),
    pl.col("net_amount").sum().alias("net_sum"),
    pl.col("discount").mean().alias("avg_discount"),
  )
)

The expensive part is CSV scanning, projection, expression evaluation, grouping, aggregation, and execution inside Polars. The Python code is mostly plan construction.

This is where the usual “Python is slow” sentence loses precision. Plain Python loops are slow for numerical and row-wise work. Python as an interface to a native query engine can be very fast. Those are different claims, and engineering decisions get worse when they are treated as the same claim.

The Rust Polars results should not be read as a permanent ranking of Polars frontends. A deeper Polars-specific study would need to pin versions, inspect plans more carefully, normalize thread counts, and isolate CSV scanning from query execution.

For the rewrite question, though, the lesson is already useful. If your Python pipeline spends most of its time in Polars, DuckDB, NumPy, PyArrow, or another optimized engine, a Rust rewrite may not attack the dominant cost. You may get more from changing the query plan, storage format, partitioning strategy, thread settings, or data layout than from changing the language around the engine.

This is also where “Python as orchestration” can become its own trap. Real pipelines are rarely pure query plans from beginning to end. A Polars section may be fast, then the code may convert to Python objects, call a Python UDF, loop over groups, run custom validation, or pass rows through a callback. Once that happens, the expensive part may have moved back into Python without looking obvious from the top-level script. Profiling is what keeps both stories honest: Python is not automatically the bottleneck, and using a native-backed library does not automatically remove Python from the hot path.

The Numeric Kernel Was Really A Parallelism Story

The numeric task computed a haversine-like distance transform over 50 million values, clipped the result, and calculated summary statistics. Python used NumPy. Rust had two versions: a native scalar implementation and a Rayon implementation.

Rust with Rayon won on every host, with a 2.65x to 4.63x speedup over the NumPy baseline.

Median speedup by task and engine

That result looks like a simple Rust win until you compare it with native Rust without Rayon. The non-Rayon Rust version did not consistently beat NumPy. It was faster on one WSL2 run, roughly comparable on the native Fedora Ryzen AI Max+ 395 run, and slower on the Apple and Ryzen 9 7940HS runs.

That pattern matters. The advantage came primarily from the execution strategy: Rayon parallelized the computation across cores. NumPy was already running optimized native kernels. Scalar Rust by itself was not a magic replacement for that.

For numerical Python users, this is a familiar but important boundary. Execution strategy mattered more than the language label. If your workload is already vectorized, the first comparison should be between specific strategies: vectorized kernels, thread-level parallelism, memory layout, SIMD, GPU execution, or a library like NumPy, JAX, PyTorch, Numba, Polars, DuckDB, or Rust with Rayon.

For Rust-curious readers, the same result is a useful correction. Rust gives you the tools to write fast numerical code, but you still have to choose the right structure. A direct scalar translation of vectorized Python may disappoint. A parallel implementation that matches the hardware can be a real improvement.

The Loop-Heavy Tasks Exposed The Interpreter

The largest Rust advantages appeared when the benchmark moved away from library-shaped dataframe operations and into custom row-wise state.

The sessionization and streaming-score tasks both do small pieces of stateful work per record: parse a row, update state, branch on event type, and carry information forward to the next row. They are meant to resemble custom enrichment, fraud/risk scoring, online feature calculation, or event-driven ETL.

These tasks expose ordinary Python much more directly:

1
2
3
4
5
6
7
8
9


if event_type in (1, 4, 7):
  state.score += value * 1.7
elif event_type in (2, 5):
  state.score -= value * 0.4
else:
  state.score += value * 0.1

if state.score < 0.0:
  state.score *= 0.5

The Rust implementation has the same basic shape. The difference is that this work is compiled into the hot loop rather than dispatched through Python objects for every row.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


let state = states.entry(user_id).or_insert(UserState {
  last_ts: ts,
  score: 0.0,
  events: 0,
});

let gap = (ts - state.last_ts).max(0) as f64;
state.score *= (-gap / 3_600.0).exp();

state.score += match event_type {
  1 | 4 | 7 => value * 1.7,
  2 | 5 => -value * 0.4,
  _ => value * 0.1,
};

if state.score < 0.0 {
  state.score *= 0.5;
}

The snippets are illustrative; the full Python and Rust implementations are in the benchmark archive. I would not claim that the Rust versions are globally optimal, only that they are straightforward implementations of the same workload family. That matters because a benchmark can accidentally measure implementation quality as much as language choice.

The Python code is readable, flexible, and easy to adapt. It also runs once per row in Python. At five million rows, the convenience has a price.

Native Rust was 8.08x to 15.14x faster than the Python CSV loop for sessionization, depending on host. For streaming-score, Rust was 8.62x to 14.03x faster.

Speedup by host, task, and engine

This is where Rust’s strengths match the workload. The hot path is custom logic. The state is explicit. The branches are unavoidable. The program does small amounts of work per record, many times, and cannot hide behind a vectorized kernel.

This kind of result can change architecture, but it should not be read as a production conversion rate. If a daily batch job spends two hours in custom Python event logic, a Rust prototype that is 10x faster on the isolated hot path is worth attention. The full production job may still spend time waiting on storage, moving data between systems, allocating memory, serializing records, or contending with other work on the same machine. If a streaming service is CPU-bound on per-user scoring, moving the hot loop into Rust may reduce machines, latency, or operational headroom. In those cases, the rewrite is not a matter of language taste. It attacks the part of the system where Python is actually doing the expensive work.

Hardware Changed The Details, Not The Pattern

The fastest machine depended on the task. The Apple M3 Max led the best Python Polars groupby, Python Polars join, and Rust Rayon numeric results. The native Fedora Ryzen AI Max+ 395 led the Rust native sessionization and streaming-score results.

Cross-machine runtime ratio

The WSL2 run on the Ryzen AI Max+ 395 was close to native Fedora for the Rust loop workloads, but slower for Polars tasks and numeric parallel work. That suggests sensitivity to thread scheduling, filesystem behavior, CSV scanning paths, memory behavior, or runtime configuration.

This is another reason I do not like language-only claims. Once the benchmark becomes real enough to touch storage, threads, parsers, and operating systems, the neat ranking gets less neat. Hardware and software stack choices do not erase the broad pattern, but they can change the size of the win.

For production decisions, that means local measurement matters. If the candidate rewrite depends on a 2x improvement, you should be skeptical until it is measured in an environment close to the one that will run the job. If the candidate rewrite is attacking a Python loop and the prototype is already 10x faster, the decision has more room for ordinary engineering noise.

What I Would Do Differently On A Real Team

The practical takeaway is a diagnostic, not a preference for one language.

If the bottleneck is dataframe or query execution, I would first inspect the query plan, file format, projection pruning, partitioning, thread settings, and engine choice. A Python pipeline built around Polars, DuckDB, PyArrow, or another mature engine may already have the expensive work in native code. In that case, a Rust rewrite may mostly replace the wrapper.

If the bottleneck is numerical kernels, I would compare execution strategies before comparing languages. NumPy is a strong baseline. Rust with Rayon, Numba, JAX, PyTorch, SIMD-aware code, or GPU execution might win, but the winning factor may be vectorization, parallelism, memory layout, or accelerator use rather than Rust alone.

If the bottleneck is interpreter-bound custom logic, I would look much harder at Rust or another compiled path. Python is slow in specific ways: scalar loops, Python objects, per-row branches, string parsing, mutable dictionaries, dynamic dispatch, and state machines that update one record at a time. Sessionization, streaming state, parsing-heavy enrichment, fraud scoring, event normalization, and online feature generation are exactly the places where ordinary Python loops can become the cost center.

That is what the sessionization and streaming-score tasks showed. The Python versions were not slow because the file extension was .py. They were slow because millions of records passed through ordinary Python control flow. Rust helped because it changed the execution mode of the bottleneck.

The second axis is ecosystem maturity. Rust’s data ecosystem is growing quickly, and in some areas it is already excellent: parsers, command-line tools, services, streaming components, storage-adjacent code, concurrency-heavy systems, and libraries where predictable performance matters. I would not hesitate to consider Rust for those parts of a data platform.

Python’s advantage in machine learning and statistical computing is not just the number of packages on PyPI. It is the whole working loop around those packages. A data scientist can explore data in a notebook, fit a model with scikit-learn, inspect residuals with statsmodels, plot diagnostics, try a PyTorch baseline, export predictions, and send the result to a teammate who can usually run the same workflow. The APIs, tutorials, debugging habits, plotting tools, serialization formats, and deployment conventions are built around years of daily use.

That maturity matters when performance is only one part of the job. If a team needs mixed-effects models, survival analysis, experiment diagnostics, model monitoring, feature notebooks, or ad hoc visualization, the language with the faster inner loop may still make the surrounding work harder. Rust can be the right implementation language for a component without being the best center of gravity for the entire analytical workflow.

The Polars result is a useful reminder of this. Polars itself is rooted in Rust, but the Python-facing path was the fastest dataframe option in these measurements. That does not mean Python Polars is always faster than Rust Polars. It does mean that a language’s theoretical performance ceiling is not the same thing as the maturity of a particular library path. Defaults, feature flags, packaging, frontend behavior, CSV scanning, and common user workflows all matter.

That is why the hybrid option is often the most practical answer. A full rewrite changes hiring, debugging, packaging, deployment, and iteration speed. Moving one hot loop behind a stable interface is a smaller bet. It preserves Python where the ecosystem is productive and uses Rust where Rust changes the execution model of the bottleneck.

The Benchmark Does Not Settle The Question

Of course, there are several limitations in this expriment.

The data is synthetic. The runs measure elapsed time, not CPU utilization, memory bandwidth, allocation behavior, cache misses, or energy use. Each host contributed one suite run, even though each task/engine group had repeated timings. The benchmark compares concrete implementations, not all possible Python and Rust programs. A Python version using Numba, Cython, PyArrow, DuckDB, or a different Polars mode could change some results. A more tuned Rust implementation could change others.

Real pipelines add several sources of noise and delay that this benchmark does not model. They wait on object storage and databases. They run under memory pressure. They may hit garbage collection pauses, Python GIL interactions in threaded code, scheduler contention, logging overhead, network transfers, and conversion costs between Arrow, pandas, Polars, NumPy, JSON, and application objects. A loop benchmark that improves by 8x to 15x can still translate into a much smaller end-to-end win if that loop is only one slice of the production runtime.

The Polars comparison especially deserves restraint. Python Polars winning in these runs is real for this archive, but it should not be turned into a universal law about Rust Polars or dataframe execution. It is a prompt for better measurement and a reminder that ecosystem maturity shows up in performance too.

Still, the benchmark did answer the question that motivated it. A Python-to-Rust rewrite is several different moves wearing the same name.

If Python is dispatching work to a native library, the rewrite may mostly replace the wrapper. If Python is running the hot loop itself, the rewrite may replace the bottleneck. Those two situations feel similar when you look at the repository language. They are very different when you look at the CPU.

That is the take-home message I would use in practice: ask what is executing the expensive part of your program, and ask which ecosystem best supports the work around it. Keep Python where mature libraries and workflows dominate. Use Rust where the bottleneck is custom, stateful, low-level, service-like, or otherwise trapped in ordinary Python execution.

Do not choose a language first. Choose the execution model your bottleneck needs.

The Rewrite Question Usually Starts Too Late#

I Wanted A Benchmark That Did Not Cheat For Either Side#

The Polars Result Was A Frontend Story#

The Numeric Kernel Was Really A Parallelism Story#

The Loop-Heavy Tasks Exposed The Interpreter#

Hardware Changed The Details, Not The Pattern#

What I Would Do Differently On A Real Team#

The Benchmark Does Not Settle The Question#