Introduction

DeepSeek has recently generated buzz across the AI community—especially its R1 model, which has stirred both excitement and concern over data transparency and security. From my personal experiments on inference time scaling, good reasoning models highly depend on foundation model’s capabilities. In this respect, DeepSeek‑V3, the heart of R1, deserves careful review. It indeed represents a solid example of how modern LLM research builds cumulatively on past work. Rather than a radical departure, DeepSeek‑V3 is the product of incremental progress—integrating efficient attention mechanisms, advanced mixture‑of‑experts (MoE) designs, multi‑token prediction (MTP), and low‑precision training.

In this two‑part series, Part 1 focuses on DeepSeek‑V3’s technical innovations and their roots in prior research, while Part 2 will examine how the R1 model makes further progresses.

It is worth noting that, due to transparent documentation and open-source nature, understanding DeepSeek‑V3 provides a practical way to catch up with modern LLM developments. For those interested in diving deeper, a full APA‑style reference list with links is provided at the end.

Disclaimer: Despite efforts to remain fair and balanced, some residual bias may remain in these views.


1. Multi‑Head Latent Attention (MLA)

Traditional multi‑head attention (Vaswani et al., 2017) demands large key–value caches that can be inefficient for long sequences. DeepSeek‑V3’s MLA innovates by compressing these keys and values into low‑rank latent vectors, similar to techniques explored in Linformer (Wang et al., 2020) and Performer (Choromanski et al., 2020).

The model projects full key–value pairs using a learned compression matrix to obtain a compact latent representation. This can be viewed as a “summary” of the full attention data—capturing the essential features without storing every detail.

Below is a simplified diagram illustrating the process:

graph TD A[Input Token Embedding] B[Generate Key/Value Vectors] C[Apply Compression Matrix] D[Obtain Latent Representation] E[Use in Multi‑Head Attention] A --> B B --> C C --> D D --> E

2. Mixture‑of‑Experts (MoE) with Auxiliary‑Loss‑Free Load Balancing

MoE architectures were initially introduced by Shazeer et al. (2017) and refined in models like the Switch Transformer (Fedus et al., 2021). In a typical MoE, a gating network selects a small subset of expert networks for processing each input. Traditional approaches often add an auxiliary loss to encourage balanced expert usage. DeepSeek‑V3, however, avoids extra loss terms by dynamically adjusting bias values during routing.

The dynamic bias adjustment works as follows: after computing the affinity scores for each expert, a bias term (which is updated based on the load of each expert) is added to these scores before selecting the top‑K experts. This ensures a more natural balancing without penalizing the model with an additional loss.

Below is an illustrative (and simplified) pseudocode for dynamic bias adjustment in MoE routing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def route_token(token_representation, experts, bias_terms, K):
    # Compute affinity scores for each expert
    affinity_scores = [compute_affinity(token_representation, expert)
                       for expert in experts]
    # Add dynamic bias to each score
    biased_scores = [score + bias for score, bias in zip(affinity_scores, bias_terms)]
    
    # Select top-K experts based on biased scores
    selected_indices = top_k_indices(biased_scores, K)
    
    # Update bias terms dynamically based on load
    for idx in range(len(bias_terms)):
        if idx in selected_indices:
            # Decrease bias for overloaded experts
            bias_terms[idx] -= update_rate  
        else:
            # Increase bias for underutilized experts
            bias_terms[idx] += update_rate  
    
    return selected_indices, bias_terms

Analogy: Imagine a restaurant kitchen where the head chef dynamically assigns orders based on each chef’s current load. Instead of enforcing strict quotas (auxiliary loss), the head chef adjusts assignments in real time so that no one is overburdened.


3. Multi‑Token Prediction (MTP) Objective

While conventional LLMs predict one token at a time, DeepSeek‑V3 employs an MTP objective that encourages the model to generate several tokens in sequence within a single forward pass. This approach builds on ideas from chain‑of‑thought prompting (Wei et al., 2022) and speculative decoding (Xia et al., 2023).

For each token, instead of outputting only the next token, the model produces a sequence of predictions that maintain the full causal chain. This helps the model “plan ahead” and densifies the training signal.

See the pseudocode below for illustration of MTP:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def multi_token_prediction(model, input_sequence, prediction_depth):
    predictions = []
    current_input = input_sequence
    
    for depth in range(prediction_depth):
        # Get model output probabilities for current input
        output_probs = model.forward(current_input)
        # Sample the next token from the output probabilities
        next_token = sample_from_distribution(output_probs)
        predictions.append(next_token)
        # Append the predicted token to current input to maintain causal chain
        current_input = current_input + [next_token]
    
    return predictions

Analogy: Think of a chess player who calculates several moves in advance. Instead of just focusing on the next move, the player envisions a sequence of moves, leading to a more strategic and coherent game plan.


4. FP8 Mixed Precision Training

DeepSeek‑V3 leverages FP8 mixed precision training to reduce both memory usage and computational cost. Building on the work of Micikevicius et al. (2018) with FP16, DeepSeek‑V3 uses 8‑bit arithmetic for many operations while retaining higher precision (e.g., BF16) for sensitive parts like optimizer states.

Analogy: This is akin to using an efficient video compression algorithm: you reduce file size dramatically while keeping the key visual elements intact. Such a strategy is broadly applicable to any large‑scale training effort.


5. DualPipe Pipeline Parallelism and Extended Context via YaRN

Efficiently training large models requires minimizing idle time. DeepSeek‑V3 employs the DualPipe algorithm, which overlaps computation with inter‑GPU communication. This reduces “pipeline bubbles” and maximizes resource utilization.

Below is a diagram to illustrate DualPipe concept:

sequenceDiagram participant F as Forward Pass participant C as Communication participant B as Backward Pass F->>C: Send activations (overlap) C->>B: Receive activations (overlap) B-->>F: Start next batch processing

Additionally, using a technique called YaRN, the model extends its context window in two stages—from 4K to 32K and then to 128K tokens. This is similar to extending the range of a telescope so that you can see further without changing the lens.


6. Future Implications and Industry Applications

The innovations in DeepSeek‑V3 may have broader implications for the field of AI, not only as a whole but also individually. Here are some examples:

  • Efficient attention mechanisms (MLA) and low‑precision training could enable real‑time analysis of patient data and medical imaging, leading to faster diagnostics and personalized treatment plans.
  • The dynamic load balancing in MoE and efficient distributed training can support real‑time risk assessment, fraud detection, and automated trading systems where speed and cost‑efficiency are critical.
  • Extended context windows like YaRN method are particularly useful for analyzing long legal documents or contracts, enabling better natural language understanding and more accurate summarization.
  • The MTP objective can improve code generation and debugging tools, providing developers with more coherent and context‑aware code suggestions in integrated development environments (IDEs).
  • Models capable of processing extensive context and performing multi‑token reasoning can be deployed as intelligent tutoring systems, offering detailed explanations in subjects like mathematics and science.

ALl these innovations suggest that the next generation of LLMs will be more resource‑efficient and adaptable, opening up advanced AI applications to smaller companies and broader industries.


7. Conclusion

DeepSeek‑V3 is a testament to the power of incremental progress in AI research. It refines established ideas – from efficient attention and dynamic expert routing to multi‑token prediction and low‑precision training – to create a cost‑effective, scalable model. By understanding these transferable innovations, we researchers and practitioners can gain insights into the future trajectory of LLM development. This post (Part 1) provides an in‑depth look at DeepSeek‑V3, setting the stage for Part 2, which will explore the R1 model’s reasoning capabilities in detail.

Again, for those eager to explore further, a full APA‑style reference list with links is provided at the end. While this review strives for balance, it is not free from bias, and readers are encouraged to consult the original literature for a deeper understanding.


References

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., & Weller, A. (2020). Rethinking attention with performers. arXiv preprint arXiv:2009.14794. Retrieved from https://arxiv.org/abs/2009.14794

Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2101.03961. Retrieved from https://arxiv.org/abs/2208.07339

Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961. Retrieved from https://arxiv.org/abs/2101.03961

Li, Y., & Hoefler, T. (2021). Chimera: A scalable approach for pipeline parallelism. arXiv preprint arXiv:2107.06925. Retrieved from https://arxiv.org/abs/2107.06925

Lepikhin, D., Goyal, N., Fan, A., Jurafsky, D., Lewis, M., Zettlemoyer, L., & Dai, Z. (2021). GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668. Retrieved from https://arxiv.org/abs/2006.16668

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., & Wu, H. (2018). Mixed Precision Training. arXiv preprint arXiv:1710.03740. Retrieved from https://arxiv.org/abs/1710.03740

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. Retrieved from https://arxiv.org/abs/1701.06538

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998–6008). Retrieved from https://arxiv.org/abs/1706.03762

Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768. Retrieved from https://arxiv.org/abs/2006.04768

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903. Retrieved from https://arxiv.org/abs/2201.11903

Xia, H., Ge, T., Wang, P., Chen, S. Q., Wei, F., & Sui, Z. (2023, December). Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 3909-3925). Retrieved from https://aclanthology.org/2023.findings-emnlp.257.pdf

DeepSeek-AI. (2024, December 27). DeepSeek‑V3 Technical Report. Retrieved from https://github.com/deepseek-ai/DeepSeek-V3

Wolpe, Z. (2025). DeepSeek’s key innovations. Medium. Retrieved from https://zachcolinwolpe.medium.com/deepseeks-key-innovations-67f847ffdb35

Business Insider. (2025, January 28). The tech industry is in a frenzy over DeepSeek. Retrieved from https://www.businessinsider.com/silicon-valley-reacts-deepseek-chinese-ai-upending-tech-2025-1

The Times. (2025, January 29). China shocked the US in the AI race. Retrieved from https://www.thetimes.co.uk/article/china-ai-chatbot-us-tech-race-s709xjx9f

Financial Times. (2025, January 30). Transcript: Tech in 2025—China’s AI ‘Sputnik moment’. Retrieved from https://www.ft.com/content/7b69ca53-79fe-43e6-9305-498d07318993

The Verge. (2025, January 28). Why everyone is freaking out about DeepSeek. Retrieved from https://www.theverge.com/ai-artificial-intelligence/598846/deepseek-big-tech-ai-industry-nvidia-impac

Wikipedia. (2025, January 29). DeepSeek. Retrieved from https://en.wikipedia.org/wiki/DeepSeek

Wikipedia. (2025, January 30). Reflection (artificial intelligence). Retrieved from https://en.wikipedia.org/wiki/Reflection_%28artificial_intelligence%29