Picture this: You’re implementing a complex neural network attention mechanism, and what should be elegant mathematical operations have devolved into a maze of None indexing, cryptic axis parameters, and debugging sessions that last longer than your coffee stays warm. If you’ve been there, you’re not alone.

I recently read an article titled “I don’t like NumPy” that articulated some frustrations many of us have experienced when working with multi-dimensional matrices in Python. The author makes compelling points about the cognitive overhead of NumPy’s design choices, particularly when dealing with operations across multiple dimensions.

As someone who’s spent years working with various numerical computing frameworks, I wanted to share my thoughts about the issue that the referred article’s author raised, responding with some context, alternatives, and potential solutions.


The NumPy Critique: Valid Pain Points

The author’s critique centers on several key issues that many NumPy users have encountered:

The Broadcasting Maze

NumPy’s broadcasting is elegant for simple cases but becomes bewildering for complex operations. The author illustrates this with a calculation involving three arrays of different dimensions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Mathematical expression: Dkn = 1/(LM) × ∑lm Aklm Bln Ckm
# What we wish we could write:
# D = mean_over_lm(A * B * C)

# What we actually have to write with NumPy broadcasting:
D = np.mean(
    np.mean(
        A[:,:,:,None] *      # Add dimension: (k,l,m) → (k,l,m,1)
        B[None,:,None,:] *   # Reshape: (l,n) → (1,l,1,n) 
        C[:,None,:,None],    # Reshape: (k,m) → (k,1,m,1)
    axis=1),  # Average over 'l' dimension
axis=1)       # Average over 'm' dimension (now at index 1)

Look at that code. Even with comments, it takes mental effort to verify it’s correct. The None insertions feel like incantations, and the axis parameters require you to track how dimensions shift after each operation. This isn’t elegant—it’s cognitive overhead that scales poorly.

The Linear Algebra Lottery

Simple operations become guessing games when you need to work with batches:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# This is clear and obvious
A = np.random.randn(3, 3)  # Single 3x3 matrix
x = np.random.randn(3)     # Single vector
y = np.linalg.solve(A, x)  # Solve Ax = y

# But what about solving 100 different 3x3 systems?
A_batch = np.random.randn(100, 3, 3)  # 100 matrices
x_batch = np.random.randn(100, 3)     # 100 vectors

# Which one of these actually works?
y = np.linalg.solve(A_batch, x_batch)                  # Attempt 1
y = np.linalg.solve(A_batch, x_batch, axis=0)          # Attempt 2  
y = np.linalg.solve(A_batch, x_batch, axes=[[1,2], 1]) # Attempt 3
# ...after consulting docs and etc. ...
y = np.linalg.solve(A_batch, x_batch)  # (It was attempt 1, but how would you know?)

The documentation often leaves you guessing, and different functions handle batching inconsistently. At worst, what should be a natural extension of single-case operations becomes its own research project.

The Indexing Roulette

Perhaps most insidiously, NumPy’s advanced indexing creates silent surprises that can lurk in your code for months:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Set up a 4D array representing (batch, height, width, channels)
A = np.ones((10, 20, 30, 40))

# Select specific indices
i = np.array([1, 2, 3])      # height indices
j = np.array([[0], [1]])     # width indices (2D for broadcasting)

# These all do different things, and the differences are subtle:
B = A[:, i, j, :]     # Shape: (10, 2, 3, 40) - broadcasts i and j
C = A[:, :, i, j]     # Shape: (10, 20, 2, 3) - different broadcast
D = A[:, i, :, j]     # Shape: (2, 3, 10, 30) - dimensions reordered!
E = A[:, 1:4, j, :]   # Shape: (10, 3, 2, 1, 40) - extra dimension appears!

The last case is particularly evil—slicing and advanced indexing interact in ways that can introduce unexpected dimensions. These aren’t just academic curiosities; they’re bugs waiting to happen in production code.

The Abstraction Breakdown

But the deepest problem is philosophical. NumPy breaks one of programming’s fundamental principles: you should be able to build complex operations from simple, reusable components.

Consider implementing attention mechanisms. The single-head version is clean:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def attention(X, W_q, W_k, W_v):
    """Clean, readable single-head attention"""
    d_k = W_q.shape[-1]
    
    # Linear transformations
    Q = X @ W_q  # Query: (seq_len, d_k)
    K = X @ W_k  # Key: (seq_len, d_k)  
    V = X @ W_v  # Value: (seq_len, d_v)
    
    # Attention computation
    scores = Q @ K.T / np.sqrt(d_k)           # (seq_len, seq_len)
    attention_weights = softmax(scores, axis=-1)  # (seq_len, seq_len)
    
    return attention_weights @ V  # (seq_len, d_v)

Beautiful! The math maps directly to code. But now you need multi-head attention, and suddenly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def multi_head_attention(X, W_q, W_k, W_v, W_o):
    """The same logic, but now unrecognizable"""
    seq_len, input_dim = X.shape
    num_heads, d_k = W_q.shape[0], W_q.shape[-1]
    
    # Can't reuse our clean attention function - must rewrite everything
    Q = np.einsum('si,hij->hsj', X, W_q)  # What happened to simple @?
    K = np.einsum('si,hik->hsk', X, W_k)  # Why do we need einsum wizardry?
    V = np.einsum('si,hiv->hsv', X, W_v)  # This used to be readable!
    
    # The core attention logic, now buried in tensor gymnastics
    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(d_k)
    weights = softmax(scores, axis=-1)
    output = weights @ V
    
    # More gymnastics to get back to reasonable shape
    projected = np.einsum('hsv,hvd->hsd', output, W_o)
    return projected.transpose(1, 0, 2).reshape(seq_len, input_dim)

We’ve lost all the elegance. The clean mathematical operations are buried under dimension shuffling, and we can’t reuse our original attention function. This violates the DRY principle and creates maintenance headaches.


The Einstein Summation Secret Weapon

Before we explore alternatives to NumPy, let’s talk about a tool hiding in plain sight: np.einsum. Think of Einstein summation as NumPy’s secret language for expressing tensor operations clearly.

Why einsum Matters

Einstein summation notation eliminates much of the broadcasting confusion by making index relationships explicit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# That confusing broadcasting example from earlier:
D_broadcasting = np.mean(np.mean(
    A[:,:,:,None] * B[None,:,None,:] * C[:,None,:,None], 
    axis=1), axis=1)

# With einsum - the intention is crystal clear:
D_einsum = np.einsum('klm,ln,km->kn', A, B, C) / (L * M)
#                    ↑    ↑   ↑     ↑
#                    |    |   |     output indices
#                    |    |   C has indices k,m
#                    |    B has indices l,n  
#                    A has indices k,l,m

The einsum version explicitly shows which indices are summed over (l and m disappear from the output) and which are preserved (k and n remain). No mental gymnastics required.

einsum Solves Indexing Confusion

Remember the attention mechanism that became unreadable with multiple heads?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def multi_head_attention_einsum(X, W_q, W_k, W_v, W_o):
    """Multi-head attention with clear einsum operations"""
    d_k = W_q.shape[-1]
    
    # Linear transformations - clear and explicit
    Q = np.einsum('si,hij->hsj', X, W_q)  # seq,input × head,input,dim → head,seq,dim
    K = np.einsum('si,hik->hsk', X, W_k)  # seq,input × head,input,dim → head,seq,dim
    V = np.einsum('si,hiv->hsv', X, W_v)  # seq,input × head,input,dim → head,seq,dim
    
    # Attention scores - matrix multiply per head
    scores = np.einsum('hsi,hti->hst', Q, K) / np.sqrt(d_k)  # head,seq,dim × head,seq,dim → head,seq,seq
    weights = softmax(scores, axis=-1)
    
    # Apply attention weights
    output = np.einsum('hst,htv->hsv', weights, V)  # head,seq,seq × head,seq,dim → head,seq,dim
    
    # Final projection and reshape
    projected = np.einsum('hsv,hvd->hsd', output, W_o)  # head,seq,dim × head,dim,out → head,seq,out
    return np.einsum('hsd->shd', projected).reshape(X.shape[0], -1)  # head,seq,out → seq,head*out

Each einsum operation explicitly states what it’s doing. While it requires learning the notation, once you do, complex tensor operations become much more readable than their broadcasting equivalents.


Learning from Other Languages: A World Beyond NumPy

The frustrations we’ve discussed aren’t inherent to array programming—they’re specific to NumPy’s design choices. Let’s see how other languages approach these challenges.

Julia: Multiple Dispatch and Clear Broadcasting

Julia was designed with numerical computing in mind, and it shows:

1
2
3
4
5
6
7
# Julia's explicit broadcasting with "." syntax - no guessing
A = rand(10, 20, 30)
B = rand(20, 40)  
C = rand(10, 30)

# The dot makes broadcasting explicit
D = mean(mean(A .* B' .* reshape(C, 10, 1, 30), dims=2), dims=3)

But Julia’s real superpower is that loops are fast, so you don’t need to vectorize everything:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# In Julia, this loop is as fast as vectorized operations
function batched_solve(A_batch, x_batch)
    n_batch = size(A_batch, 3)
    y = similar(x_batch)
    
    for i in 1:n_batch
        y[:, i] = A_batch[:, :, i] \ x_batch[:, i]  # Clean linear solve
    end
    
    return y
end

This eliminates much of the “vectorization pressure” that makes NumPy code cryptic.

MATLAB: Consistent Dimension Parameters

MATLAB takes a different approach with consistent dimension parameters:

1
2
3
4
5
6
7
8
9
% MATLAB's page functions handle batched operations clearly
A_batch = rand(3, 3, 100);  % 100 3x3 matrices
x_batch = rand(3, 100);     % 100 vectors

% Explicit batched solve - no guessing about syntax
y = pagemldivide(A_batch, x_batch);

% Or for element-wise operations:
D = mean(mean(A .* B .* C, 2), 3);  % Clear dimension parameters

MATLAB’s page* family of functions makes batched linear algebra explicit and consistent.

R: Apply Family and Data Manipulation

R approaches array operations with its apply family of functions:

1
2
# R's apply family for dimension operations
D <- apply(apply(A * B * C, 1, mean), 1, mean)

R also offers the tidyverse ecosystem, which completely changes the data manipulation paradigm:

1
2
3
4
# Tidyverse approach (assuming A, B, C are in appropriate format)
D <- data %>%
  group_by(k, n) %>%
  summarize(value = mean(a * b * c))

The clear, verb-based syntax makes operations more readable.


xarray: NumPy with Named Dimensions

Now we come to the most practical solution for Python users: xarray. Think of xarray as NumPy with a navigation system—instead of remembering which axis is which, you give dimensions meaningful names.

The Named Dimension Revolution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import xarray as xr

# Create arrays with meaningful dimension names
A = xr.DataArray(
    A_data, 
    dims=['batch', 'height', 'width'], 
    name='feature_maps'
)
B = xr.DataArray(
    B_data, 
    dims=['height', 'channels'], 
    name='weights'
)
C = xr.DataArray(
    C_data, 
    dims=['batch', 'width'], 
    name='biases'
)

# The calculation becomes self-documenting
D = (A * B * C).mean(dim=['height', 'width'])
#   └─ No more None indexing or axis juggling!

No more A[:,:,:,None] hieroglyphics. No more counting axes. The dimensions have names, and operations become self-documenting.

Batched Operations Made Clear

Remember the linear algebra lottery? xarray’s apply_ufunc makes batching explicit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def batched_solve(A, x):
    """Solve Ax = b for batched matrices with clear semantics"""
    return xr.apply_ufunc(
        np.linalg.solve,
        A, x,
        input_core_dims=[['row', 'col'], ['row']],  # Which dims are the matrix/vector
        output_core_dims=[['col']],                 # What dims the output has
        vectorize=True                              # Apply to all other dimensions
    )

# Usage is now obvious:
A_batch = xr.DataArray(
    np.random.randn(100, 3, 3), 
    dims=['batch', 'row', 'col']
)
x_batch = xr.DataArray(
    np.random.randn(100, 3), 
    dims=['batch', 'row']
)

y = batched_solve(A_batch, x_batch)  # Returns array with dims ['batch', 'col']

The input_core_dims and output_core_dims explicitly specify which dimensions participate in the operation and which are batch dimensions. No more guessing.

Function Composition That Actually Works

Most importantly, xarray enables the abstraction that NumPy breaks. Remember our attention mechanism?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def attention_xarray(X, W_q, W_k, W_v):
    """Clean attention that works with named dimensions"""
    d_k = W_q.sizes['key_dim']
    
    # Linear transformations using named dimensions
    Q = xr.dot(X, W_q, dims=['feature'])  # Clear about which dim to contract
    K = xr.dot(X, W_k, dims=['feature'])
    V = xr.dot(X, W_v, dims=['feature'])
    
    # Attention computation
    scores = xr.dot(Q, K, dims=['key_dim']) / np.sqrt(d_k)
    weights = xr.apply_ufunc(softmax, scores, input_core_dims=[['sequence']])
    
    return xr.dot(weights, V, dims=['sequence'])

def multi_head_attention_xarray(X, W_q, W_k, W_v, W_o):
    """Multi-head attention that reuses the single-head function!"""
    # Apply single-head attention across all heads
    attention_output = xr.apply_ufunc(
        attention_xarray,
        X, W_q, W_k, W_v,
        input_core_dims=[
            ['sequence', 'feature'],     # X dimensions
            ['feature', 'key_dim'],      # W_q dimensions
            ['feature', 'key_dim'],      # W_k dimensions  
            ['feature', 'value_dim']     # W_v dimensions
        ],
        output_core_dims=[['sequence', 'value_dim']],
        vectorize=True  # Apply across 'head' dimension
    )
    
    # Final projection
    return xr.dot(attention_output, W_o, dims=['value_dim'])

Notice something beautiful: we’re reusing our original attention_xarray function without modification. The abstraction works!


Practical Migration: You Don’t Need to Rewrite Everything

Here’s the key insight: you don’t need to throw away your entire NumPy codebase to benefit from better array programming. Smart migration is about identifying pain points and addressing them strategically.

The Hybrid Approach: NumPy + xarray

The most practical strategy is to use xarray for complex multi-dimensional operations while keeping NumPy for simple cases:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import xarray as xr

def hybrid_data_processing_pipeline(raw_data):
    """Example of strategic NumPy + xarray usage"""
    
    # Simple preprocessing: stick with NumPy
    normalized = (raw_data - np.mean(raw_data)) / np.std(raw_data)
    
    # Complex multi-dimensional operations: switch to xarray
    data_xr = xr.DataArray(
        normalized,
        dims=['batch', 'height', 'width', 'channels'],
        coords={
            'batch': range(normalized.shape[0]),
            'height': range(normalized.shape[1]),
            'width': range(normalized.shape[2]),
            'channels': ['red', 'green', 'blue']
        }
    )
    
    # Complex spatial aggregations are clear with named dims
    spatial_features = data_xr.mean(dim=['height', 'width'])
    
    # Convert back to NumPy for final processing
    return spatial_features.values

# You can mix and match based on complexity
def process_attention_layers(embeddings, attention_weights):
    """Use xarray only where it adds clear value"""
    
    # Simple operations: NumPy is fine
    normalized_embeddings = embeddings / np.linalg.norm(embeddings, axis=-1, keepdims=True)
    
    # Complex broadcasting: switch to xarray
    emb_xr = xr.DataArray(
        normalized_embeddings,
        dims=['batch', 'sequence', 'embedding']
    )
    weights_xr = xr.DataArray(
        attention_weights,
        dims=['batch', 'head', 'query_seq', 'key_seq']
    )
    
    # The complex operation becomes readable
    attended = xr.dot(weights_xr, emb_xr, dims=['key_seq'])
    
    return attended.values  # Back to NumPy for downstream code

Migration Strategy: Start Small

Phase 1: Identify Pain Points Look for code with:

  • Multiple None indices for broadcasting
  • Complex axis parameter juggling
  • Comments explaining “which dimension is which”
  • Functions that can’t be easily reused for batched operations

Phase 2: Wrap, Don’t Replace Create xarray wrappers for your most complex operations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def complex_tensor_operation_numpy(A, B, C):
    """Original NumPy version - hard to understand"""
    return np.mean(np.mean(
        A[:,:,:,None] * B[None,:,None,:] * C[:,None,:,None], 
        axis=1), axis=1)

def complex_tensor_operation_clear(A, B, C):
    """xarray wrapper - self-documenting"""
    A_xr = xr.DataArray(A, dims=['k', 'l', 'm'])
    B_xr = xr.DataArray(B, dims=['l', 'n'])  
    C_xr = xr.DataArray(C, dims=['k', 'm'])
    
    result = (A_xr * B_xr * C_xr).mean(dim=['l', 'm'])
    return result.values  # Return NumPy array for compatibility

Phase 3: Expand Strategically Gradually expand xarray usage to areas where it provides clear benefits, while keeping NumPy for simple operations.

Integration with Existing Libraries

Most scientific Python libraries work seamlessly with the hybrid approach:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA

def analysis_pipeline(data_array):
    """Seamless integration across the ecosystem"""
    
    # xarray for complex data manipulation
    data_xr = xr.DataArray(
        data_array,
        dims=['sample', 'time', 'channel', 'frequency']
    )
    
    # Complex aggregation with clear semantics
    time_averaged = data_xr.mean(dim='time')
    
    # Convert to NumPy for scikit-learn
    reshaped = time_averaged.stack(feature=['channel', 'frequency'])
    pca = PCA(n_components=10)
    components = pca.fit_transform(reshaped.values)
    
    # Convert to pandas for analysis
    df = pd.DataFrame(
        components, 
        columns=[f'PC{i+1}' for i in range(10)]
    )
    
    # Matplotlib works with everything
    plt.scatter(df['PC1'], df['PC2'])
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    
    return df

The key is using each tool for what it does best: xarray for complex array operations, NumPy for simple math, pandas for tabular data, and scikit-learn for machine learning.


Choosing Yout Path Forward

So where does this leave us? I believe the answer depends on situation (assuming you must stay in Python ecosystem instead of migrating to a whole different tech stack like Julia):

Stick with NumPy When:

  • Your operations are primarily simple mathematical computations
  • Performance is critical and you’re working with well-optimized code
  • You’re working in a constrained environment where adding dependencies is difficult
  • Your team is deeply familiar with NumPy and the operations aren’t that complex

Consider xarray When:

  • You’re regularly dealing with 3+ dimensional arrays
  • You find yourself writing comments to explain which axis is which
  • You’re building functions that should work for both single cases and batched operations
  • You’re working with scientific data that has natural coordinate systems

The einsum Middle Ground:

  • You want to stay in NumPy but make complex operations clearer
  • You’re dealing with tensor operations that map naturally to summation notation
  • You need better performance than broadcasting but don’t want to add dependencies

“Red Flags” Checklist

  • If you’re using more than 2 None indices
  • If axis parameters change after operations
  • If you can’t easily explain what a line does

The Future of Array Programming

The conversation sparked by “I don’t like NumPy” reflects a broader evolution in scientific computing. We’re moving away from the NumPy era where “vectorize everything” was the only path to performance, toward more expressive, composable systems.

Projects like JAX are bringing functional programming concepts to array operations. PyTorch/TensorFlow support dedicated tensor data types and operations (although they share some of the NumPy critiques discussed in this article). xarray is proving that named dimensions don’t have to sacrifice performance.

Even NumPy itself is evolving. The Array API Standard (NEP 47) aims to create consistency across array libraries. I hope future versions may address some of the pain points we’ve discussed.


Conclusion: Code Should Match How You Think

The original critique of NumPy wasn’t really about NumPy—it was about the gap between mathematical thinking and code implementation. When you’re thinking “average this quantity over these dimensions,” you shouldn’t have to translate that into A[:,:,:,None] hieroglyphics.

The good news is that you have options. Whether it’s learning einsum for clearer tensor operations, adopting xarray for named dimensions, or exploring languages like Julia for fundamental improvements, you can write array code that matches how you think about the problem.

The best tool is the one that makes your intentions clear, your code maintainable, and your life easier. Sometimes that’s NumPy with careful broadcasting. Sometimes it’s xarray with named dimensions. Sometimes it’s einsum with explicit summation notation.

The key insight is this: you don’t have to accept confusing code as the price of performance. The array programming landscape is rich with alternatives, and the future promises even better tools. Until then, choose the approach that makes your code as clear as your mathematical thinking.