“It’s genuinely amazing that… these sorts of things can be extracted from a statistical analysis of a large body of text,” science fiction author Ted Chiang remarked in a 2023 Financial Times interview. “But, in his view, that doesn’t make the tools intelligent. Applied statistics is a far more precise descriptor, but no one wants to use that term, because it’s not as sexy.”

The visionary writer’s observation cuts directly to the heart of modern AI development—beneath the marketing hype and sensationalized headlines lies a sobering truth: these systems, regardless of their impressive capabilities, remain fundamentally statistical models. And yes folks, that’s what we are going to explore in this article.

(This blog post is written based on the talk I gives to inhouse statisticians and analysts at Penn Medicine on April 15, 2025.)


1. Introduction

The explosion of large language models (LLMs) like ChatGPT, Claude, and their contemporaries has captivated public imagination, often blurring the line between technical reality and science fiction.

For those who are heavily trained in applied statistics, this presents both an opportunity and a challenge. Your familiarity with probabilistic modeling provides you with the perfect foundation to understand these systems, but the journey from traditional statistical models to modern LLMs requires bridging several conceptual gaps. This article aims to guide you through this journey, demonstrating how the statistical principles you already understand serve as the bedrock for even the most sophisticated text generation systems.

By tracing the evolution from Bayesian fundamentals to transformer-based architectures, we’ll illuminate how LLMs represent a natural—albeit dramatically scaled—extension of statistical modeling approaches. Rather than treating these systems as mysterious black boxes, we’ll demystify their inner workings through the lens of probability theory, helping you develop grounded, realistic expectations about their capabilities and limitations.


2. Foundations: Bayes’ Theorem

Our journey begins with Bayes’ theorem—the fundamental rule for updating probabilities given new evidence:

$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$

This elegant formula allows us to reason about conditional probabilities by relating the probability of a hypothesis given observed evidence to the probability of that evidence given the hypothesis. In the context of text generation, we can reframe this as:

$$P(\text{next word}|\text{previous words}) = \frac{P(\text{previous words}|\text{next word})P(\text{next word})}{P(\text{previous words})}$$

This Bayesian perspective enables us to view language modeling as a conditional probability estimation problem, where we’re trying to predict the next element in a sequence given what has come before.

1
2
3
4
5
6
7
8
# Bayesian update for word probabilities
def bayes_update(prior_word_probs, likelihood_given_context, context_prob):
    posterior_probs = {}
    for word in prior_word_probs:
        # P(word|context) = P(context|word) * P(word) / P(context)
        posterior_probs[word] = (likelihood_given_context[word] 
                                * prior_word_probs[word]) / context_prob
    return posterior_probs

This probabilistic foundation remains central to language models of all complexities, from simple n-gram models to transformer-based architectures with billions of parameters.


3. Naive Bayes and Text Classification

Moving from theory to application, we encounter Naive Bayes classifiers—one of the earliest successful applications of Bayesian reasoning to text processing. For text classification, Naive Bayes assumes conditional independence between features (words), which simplifies our calculation:

$$P(C|D) \propto P(C) \prod_{i=1}^n P(w_i|C)$$

Where $C$ is a class (e.g., spam or not spam), $D$ is a document, and $w_i$ are the words in the document.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Naive Bayes classifier for text
class NaiveBayesTextClassifier:
    def __init__(self):
        self.class_probs = {}  # P(C)
        self.word_given_class_probs = {}  # P(word|C)
    
    def train(self, documents, classes):
        # Calculate P(C) and P(word|C)
        for doc, doc_class in zip(documents, classes):
            # Update class probabilities
            if doc_class not in self.class_probs:
                self.class_probs[doc_class] = 0
                self.word_given_class_probs[doc_class] = {}
            self.class_probs[doc_class] += 1
            
            # Update word probabilities for this class
            for word in tokenize(doc):
                if word not in self.word_given_class_probs[doc_class]:
                    self.word_given_class_probs[doc_class][word] = 0
                self.word_given_class_probs[doc_class][word] += 1
        
        # Normalize probabilities
        total_docs = len(documents)
        for c in self.class_probs:
            self.class_probs[c] /= total_docs
            total_words = sum(self.word_given_class_probs[c].values())
            for word in self.word_given_class_probs[c]:
                self.word_given_class_probs[c][word] /= total_words
    
    def predict(self, document):
        scores = {}
        words = tokenize(document)
        
        for c in self.class_probs:
            # Start with log P(C)
            scores[c] = math.log(self.class_probs[c])
            
            # Add log P(word|C) for each word
            for word in words:
                if word in self.word_given_class_probs[c]:
                    scores[c] += math.log(self.word_given_class_probs[c][word])
                else:
                    # Handle unknown words with smoothing
                    scores[c] += math.log(SMOOTHING_CONSTANT)
        
        # Return class with highest score
        return max(scores, key=scores.get)

While effective for classification, Naive Bayes has a critical limitation for language modeling: it disregards word order and sequential dependencies. A model generating text based solely on word frequencies would produce incoherent outputs, as grammatical structure depends heavily on the ordering and relationships between words.


4. From Classification to Language Modeling

To address the limitations of bag-of-words approaches, we turn to n-gram models, which capture local sequential dependencies by modeling the probability of a word given its n-1 preceding words:

$$P(w_i|w_{i-n+1}^{i-1}) = \frac{\text{count}(w_{i-n+1}^{i})}{\text{count}(w_{i-n+1}^{i-1})}$$

This represents a significant shift toward genuine language modeling, as we’re now predicting sequences rather than isolated categories.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# N-gram language model
class NGramLanguageModel:
    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = {}
        self.context_counts = {}
        self.vocabulary = set()
    
    def train(self, corpus):
        # Add special start and end tokens
        processed_corpus = ["<START>"] + corpus + ["<END>"]
        
        # Count n-grams and contexts
        for i in range(len(processed_corpus) - self.n + 1):
            ngram = tuple(processed_corpus[i:i+self.n])
            context = tuple(processed_corpus[i:i+self.n-1])
            
            if context not in self.context_counts:
                self.context_counts[context] = 0
            self.context_counts[context] += 1
            
            if ngram not in self.ngram_counts:
                self.ngram_counts[ngram] = 0
            self.ngram_counts[ngram] += 1
            
            self.vocabulary.add(processed_corpus[i+self.n-1])
    
    def probability(self, word, context):
        # Convert context to tuple if it's a list
        if isinstance(context, list):
            context = tuple(context)
        
        # Apply smoothing to handle unseen n-grams
        if context not in self.context_counts:
            return 1 / len(self.vocabulary)  # Uniform distribution
        
        ngram = context + (word,)
        count = self.ngram_counts.get(ngram, 0) + 1  # Add-one smoothing
        context_count = self.context_counts[context] + len(self.vocabulary)
        
        return count / context_count
    
    def generate(self, length=20):
        # Start with start token
        text = ["<START>"]
        
        # Generate words until we hit the end token or length limit
        while len(text) < length and text[-1] != "<END>":
            # Get context (last n-1 words)
            context = tuple(text[-(self.n-1):]) if len(text) >= self.n-1 else tuple(["<START>"] * (self.n-1 - len(text)) + text)
            
            # Calculate probabilities for each word in vocabulary
            probabilities = [self.probability(word, context) for word in self.vocabulary]
            
            # Sample next word based on probabilities
            next_word = random.choices(list(self.vocabulary), weights=probabilities)[0]
            text.append(next_word)
        
        # Remove special tokens and return
        return [word for word in text if word not in ["<START>", "<END>"]]

N-gram models can be framed as Markov chains where the next state (word) depends only on the previous n-1 states. While this approach captures more linguistic structure than Naive Bayes, it still struggles with long-range dependencies and data sparsity—as n increases, the number of possible n-grams grows exponentially, leading to insufficient data for reliable probability estimates.


5. Next Token Prediction

The fundamental task of language modeling can be formalized as next token prediction: given a sequence of tokens $w_1, w_2, ..., w_{t-1}$, predict the next token $w_t$. This frames text generation as a sequence of probabilistic predictions:

$$P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i | w_1, w_2, ..., w_{i-1})$$

This auto-regressive approach is at the heart of modern language models, which generate text one token at a time based on all previous tokens. The quality of a language model is typically evaluated using perplexity, which is derived from the cross-entropy loss:

$$\text{Perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i|w_1,...,w_{i-1})}$$

Lower perplexity indicates a model that assigns higher probability to the correct next token, suggesting better predictive performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Next token prediction with log-likelihood calculation
def evaluate_language_model(model, test_corpus):
    log_likelihood = 0
    token_count = 0
    
    for document in test_corpus:
        tokens = tokenize(document)
        for i in range(1, len(tokens)):
            # Context is all previous tokens
            context = tokens[:i]
            next_token = tokens[i]
            
            # Get probability of next token given context
            prob = model.probability(next_token, context)
            
            # Add log probability
            log_likelihood += math.log2(prob)
            token_count += 1
    
    # Calculate average log-likelihood
    avg_log_likelihood = log_likelihood / token_count
    
    # Perplexity is 2^(-avg_log_likelihood)
    perplexity = 2 ** (-avg_log_likelihood)
    
    return {
        "avg_log_likelihood": avg_log_likelihood,
        "perplexity": perplexity
    }

While traditional n-gram models represent words as discrete symbols, modern approaches transition to continuous vector representations (embeddings), enabling more nuanced modeling of semantic relationships and mitigating the curse of dimensionality.


6. Neural Language Models Evolution

The transition to neural language models marked a paradigm shift in text generation. To quickly recap, for the purpose of our journey, it would be fine to interpret neural nets as ensembles of nested linear models with gradient descent-based optimizers.

Recurrent Neural Networks (RNNs) were the first major architecture to address the limitations of n-gram models by maintaining a hidden state that could, in theory, capture information from arbitrary distances in the sequence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Simple RNN for language modeling
class SimpleRNN:
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        self.embedding = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.W_xh = np.random.randn(embedding_dim, hidden_dim) * 0.01
        self.W_hh = np.random.randn(hidden_dim, hidden_dim) * 0.01
        self.W_hy = np.random.randn(hidden_dim, vocab_size) * 0.01
        self.b_h = np.zeros((1, hidden_dim))
        self.b_y = np.zeros((1, vocab_size))
        
    def forward(self, inputs, h_prev=None):
        # Initialize hidden state if not provided
        if h_prev is None:
            h_prev = np.zeros((1, self.W_hh.shape[0]))
        
        # Get word embeddings for input tokens
        x = self.embedding[inputs]
        
        # Initialize outputs
        h, y_pred, probs = {}, {}, {}
        h[-1] = h_prev
        
        # Forward pass
        for t in range(len(inputs)):
            # Update hidden state
            h[t] = np.tanh(np.dot(x[t], self.W_xh) + np.dot(h[t-1], self.W_hh) + self.b_h)
            
            # Compute output (unnormalized log probabilities)
            y_pred[t] = np.dot(h[t], self.W_hy) + self.b_y
            
            # Apply softmax to get probabilities
            probs[t] = np.exp(y_pred[t]) / np.sum(np.exp(y_pred[t]))
        
        return h, y_pred, probs

Despite their theoretical capacity for capturing long-range dependencies, RNNs suffered from the vanishing gradient problem, limiting their practical ability to model long sequences. Enter transformers, which revolutionized NLP by introducing the self-attention mechanism.

Transformers process entire sequences in parallel rather than sequentially, using attention mechanisms to model dependencies between tokens regardless of their distance from each other. From a statistical perspective, self-attention can be viewed as a dynamic, learned weighting system that estimates the conditional relevance of each token given all other tokens in the sequence.

Statistical Interpretation of Self-Attention

Self-attention computes three projections from each token embedding: queries (Q), keys (K), and values (V). For statisticians, these can be interpreted as:

  1. Queries (Q): Representations of the “current” token that express what information it’s looking for
  2. Keys (K): Representations of all tokens that indicate what information they contain
  3. Values (V): The actual information content of each token

The attention weights between tokens i and j can be understood as estimating a conditional probability distribution: how relevant is token j’s information given token i’s information needs?

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The softmax normalization ensures these weights sum to 1, creating a proper probability distribution. The scaling factor $\sqrt{d_k}$ stabilizes gradients, similar to variance normalization in statistical modeling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Simplified self-attention mechanism with statistical interpretation
def self_attention(query, key, value, mask=None):
    # Calculate attention scores - dot product similarity
    # Statistically: measuring compatibility between token i's query and token j's key
    attention_scores = np.dot(query, key.T) / np.sqrt(key.shape[1])
    
    # Apply mask (optional) - used in decoder to prevent attending to future tokens
    if mask is not None:
        attention_scores = attention_scores + mask * -1e9
    
    # Apply softmax to get attention weights
    # Statistically: creating a probability distribution over all tokens
    # for each position - P(relevance_j | token_i)
    attention_weights = softmax(attention_scores, axis=-1)
    
    # Calculate weighted sum
    # Statistically: expected value of values, weighted by relevance probabilities
    output = np.dot(attention_weights, value)
    
    return output, attention_weights

# Transformer encoder layer
class TransformerEncoderLayer:
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = FeedForwardNetwork(d_model, dff)
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)
    
    def forward(self, x, training, mask=None):
        # Multi-head attention
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)
        
        # Feed forward network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)
        
        return out2

From a statistical perspective, attention weights represent learned conditional probabilities—the relationship between each token and every other token in the sequence. This allows transformers to model complex dependencies without the sequential constraints of RNNs.


7. Modern Language Models

The evolution from basic transformer architectures to modern LLMs like GPT and Claude has been driven primarily by scale: larger models, more data, and more compute. However, the underlying statistical principles remain unchanged—at their core, these models are still performing next-token prediction based on learned patterns in their training data.

The training process of modern LLMs typically involves three stages:

  1. Pre-training: The model learns general language patterns through self-supervised learning on vast corpora.
  2. Supervised Fine-Tuning (SFT): The model is trained on high-quality examples of desired behaviors.
  3. Reinforcement Learning from Human Feedback (RLHF): The model is further refined based on human preferences.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Simplified RLHF pipeline
def train_with_rlhf(sft_model, reward_model, dataset):
    # Initialize policy model (π) as a copy of SFT model
    policy_model = copy.deepcopy(sft_model)
    
    # Initialize reference model (πref) - frozen copy of SFT model
    reference_model = copy.deepcopy(sft_model)
    reference_model.requires_grad = False
    
    # PPO optimization loop
    for epoch in range(NUM_EPOCHS):
        for batch in dataset:
            prompts = batch["prompts"]
            
            # Sample responses from current policy
            responses = policy_model.generate(prompts)
            
            # Evaluate rewards using reward model
            rewards = reward_model(prompts, responses)
            
            # Calculate log probabilities
            policy_log_probs = policy_model.log_prob(responses, prompts)
            ref_log_probs = reference_model.log_prob(responses, prompts)
            
            # Calculate KL penalty from reference model
            kl_penalty = policy_log_probs - ref_log_probs
            
            # Calculate PPO advantage (reward - KL penalty)
            advantage = rewards - KL_COEF * kl_penalty
            
            # Calculate PPO loss
            ratio = torch.exp(policy_log_probs - ref_log_probs)
            clipped_ratio = torch.clamp(ratio, 1-CLIP_EPSILON, 1+CLIP_EPSILON)
            ppo_loss = -torch.min(ratio * advantage, clipped_ratio * advantage).mean()
            
            # Update policy model
            optimizer.zero_grad()
            ppo_loss.backward()
            optimizer.step()
    
    return policy_model

The generation process in modern LLMs introduces its own statistical nuances, particularly in sampling methods:

  1. Temperature Sampling: Controls randomness by scaling logits before softmax:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    # Temperature sampling
    def sample_with_temperature(logits, temperature=1.0):
        # Scale logits by temperature
        scaled_logits = logits / temperature
    
        # Apply softmax to get probabilities
        probs = softmax(scaled_logits)
    
        # Sample from distribution
        return np.random.choice(len(probs), p=probs)
    
  2. Top-k Sampling: Restricts sampling to the k most likely tokens:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    
    # Top-k sampling
    def sample_top_k(logits, k=40):
        # Get indices of top k values
        top_k_indices = np.argsort(logits)[-k:]
    
        # Create a mask for top k values
        mask = np.zeros_like(logits)
        mask[top_k_indices] = 1
    
        # Zero out non-top-k logits
        logits = logits * mask
    
        # Apply softmax to get probabilities
        probs = softmax(logits)
    
        # Sample from distribution
        return np.random.choice(len(probs), p=probs)
    
  3. Nucleus (Top-p) Sampling: Samples from the smallest set of tokens whose cumulative probability exceeds p:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    
    # Nucleus (top-p) sampling
    def sample_top_p(logits, p=0.9):
        # Sort logits in descending order
        sorted_indices = np.argsort(logits)[::-1]
        sorted_logits = logits[sorted_indices]
    
        # Calculate softmax probabilities
        sorted_probs = softmax(sorted_logits)
    
        # Calculate cumulative probabilities
        cumulative_probs = np.cumsum(sorted_probs)
    
        # Find cutoff index where cumulative probability exceeds p
        cutoff_idx = np.where(cumulative_probs > p)[0][0]
    
        # Create mask for top-p tokens
        mask = np.zeros_like(logits)
        mask[sorted_indices[:cutoff_idx+1]] = 1
    
        # Zero out non-top-p logits
        logits = logits * mask
    
        # Apply softmax to get probabilities
        probs = softmax(logits)
    
        # Sample from distribution
        return np.random.choice(len(probs), p=probs)
    

These sampling methods provide statistical control over the trade-off between deterministic coherence and creative diversity in generated text.

That’s it. We have finally explored what are going on under the hood of modern LLMs. While the latest SOTA models/products come with a few more innovations, current generation of autoregressive gen AIs is based on what we’ve covered so far.


8. Limitations and Challenges

Despite their impressive capabilities, modern LLMs face fundamental limitations rooted in their statistical nature:

  1. Context Window Constraints: LLMs can only process a finite number of tokens, limiting their ability to reference information outside this window.

  2. Static Knowledge Cutoff: These models have fixed knowledge from their training data and cannot learn dynamically during deployment.

  3. Statistical “Knowledge”: What appears as factual knowledge is actually statistical pattern recognition, not true understanding.

  4. Hallucination: The statistical nature of these models leads to confident generation of false information when patterns are extrapolated incorrectly. Common hallucination patterns include: (1) High confidence on rare/unseen combinations, (2) Plausible-sounding but false specifics, and (3) Contradiction with earlier generated content.

Evaluation remains a significant challenge. Traditional metrics each capture different aspects of model performance:

  1. Perplexity: The standard statistical measure of how well a probability model predicts a sample. Lower perplexity indicates better prediction of the next token, but correlates poorly with human judgments of overall quality.

  2. BLEU/ROUGE: These n-gram overlap metrics from machine translation can be interpreted as measuring the statistical similarity between generated and reference texts. While useful for specific tasks, they often fail to capture semantic correctness.

  3. Human-aligned metrics: Newer benchmarks like HELM (Holistic Evaluation of Language Models) attempt to systematically measure alignment with human preferences across dimensions like truthfulness, toxicity, and bias. These can be understood as estimating joint distributions across multiple evaluation criteria.

This diversity of metrics creates a fundamental tension in the development cycle, as optimization targets may not align with actual desired outcomes. From a statistical perspective, this highlights the challenge of defining appropriate loss functions when the true objective function (human satisfaction) is complex and not directly measurable.


9. Ongoing Research and Developments

Going forward, both academia and industry identify several promising directions that emerge for addressing the known limitations of LLMs:

  1. Retrieval-Augmented Generation (RAG): Enhancing LLMs with external knowledge retrieval systems to improve factual accuracy and reduce hallucination.

  2. Tool-Use and Agentic Behavior: Enabling models to interact with external tools and APIs to overcome the limitations of a static knowledge base.

  3. Multi-Modal Models: Integrating text, vision, and potentially other modalities to build richer representations of the world.

  4. Improved Alignment Techniques: Moving beyond RLHF to more sophisticated approaches for aligning model behavior with human values and intentions.

From a statistical perspective, these advancements represent attempts to extend the model’s probabilistic framework beyond pure pattern recognition in text, incorporating structured knowledge and causal reasoning.


10. Conclusion

The journey from Bayes to ChatGPT reveals a fundamental continuity in statistical thinking. Modern LLMs, despite their enormous scale and complexity, remain fundamentally probabilistic models that capture statistical patterns in language. Their “intelligence” emerges not from true understanding but from sophisticated pattern recognition, applied at unprecedented scale.

For statisticians, this perspective offers both insight and opportunity. Your expertise in probabilistic modeling provides a solid foundation for understanding these systems, while the limitations of current approaches highlight the need for innovative statistical techniques that can bridge the gap between pattern recognition and genuine understanding.

As we continue to develop and deploy these powerful tools, maintaining this statistical grounding will be essential—not only for technical advancement but also for responsible governance and realistic public expectations. The next chapter in this statistical journey awaits your contributions.


Further Resources

Below is my own curated list of worthy and established groundbreaking literature for interested parties:

  • “Speech and Language Processing” by Jurafsky & Martin
  • “Attention Is All You Need” (Vaswani et al., 2017)
  • “Language Models are Few-Shot Learners” (Brown et al., 2020)