Disclaimer: Despite efforts to remain fair and balanced, some residual bias may remain in these views.
Deep learning has long been driven by scaling—making models larger, training on more data, and increasing computational heft. In recent years, however, researchers have shifted some focus from training-time scaling to inference-time scaling: the idea that allocating additional compute at test time can unlock improved model performance without necessarily enlarging the model itself. In this post, we explore this emerging paradigm, review how OpenAI’s o1-preview model has already influenced the field, and then dive into DeepSeek R1—a Chinese innovation that leverages these principles to enhance reasoning capabilities at a fraction of conventional costs.
Aside: This is the second aricle of the two‑part series. Also check out the part 1 where we focus on DeepSeek‑V3’s technical innovations and their roots in prior research
Inference-Time Scaling: The New Frontier
Traditional scaling laws emphasized increasing model size and training data to improve performance. More recently, however, studies have shown that using extra computational resources during inference can yield significant improvements in accuracy, robustness, and reasoning ability. For example, research on diffusion models has demonstrated that beyond simply increasing denoising steps, a search framework over potential noise inputs can lead to better generated outputs (Ma et al., 2025). Similar ideas have been applied to language models, where techniques such as repeated sampling (as seen in “Large Language Monkeys”) help boost performance by evaluating multiple candidate outputs (Brown et al., 2024).
This approach is analogous to a chess player taking extra time to consider several potential moves and ramifications before committing to one—a process that can markedly improve decision-making. In the AI context, “test-time compute” is emerging as a cost-effective means to boost a model’s capabilities without necessitating massive increases in training cost.
The Impact of OpenAI’s o1-Preview on AI Research
OpenAI’s o1-preview marked a turning point for inference-time scaling in reasoning models. By deliberately spending more time “thinking”—that is, generating a longer chain of thought before finalizing an answer—o1-preview achieved groundbreaking improvements in complex tasks. For instance, it scored 83% on advanced math benchmarks compared to only 13% for its predecessor, GPT-4o (OpenAI, 2024).
For practitioners, o1-preview provided a new tool to tackle problems in science, coding, and mathematics more reliably. Its approach of iterative reasoning and search during inference allowed researchers and developers to explore multiple candidate responses and select the best one—an approach that not only enhances performance but also opens the door to more transparent, step-by-step problem solving.
DeepSeek R1: Unpacking Its Unique Reasoning Pipeline
Building upon these foundational ideas, DeepSeek R1 introduces a suite of transferable innovations designed specifically for reasoning tasks. Unlike its predecessor DeepSeek V3—which focused on non-reasoning foundation tasks—R1 targets the challenges of complex problem solving by investing inference-time compute into a search framework and a novel reinforcement learning (RL) strategy.
Reinforcement Learning: RL-Only vs. RL-SFT
DeepSeek R1’s training pipeline experiments with two main RL-based strategies:
-
R1 Zero: RL-Only Approach:
Here, the model is trained using pure reinforcement learning. It is rewarded solely based on the accuracy and coherence of its generated final answers and intermediate reasoning steps. While this approach improves problem-solving abilities, it sometimes leads to unstable learning dynamics and overfitting on certain query types. -
R1: RL-SFT (Reinforcement Learning with Supervised Fine-Tuning):
To address these challenges, DeepSeek R1 incorporates a hybrid method. The model is first fine-tuned via supervised learning (SFT) on a large dataset of high-quality chain-of-thought examples. This stage establishes a strong baseline of coherent reasoning. Next, reinforcement learning—using its customized GRPO algorithm—is applied to further refine the model’s decision-making. Notably, the RL phase rewards not only the correctness of the final answer but also the transparency and consistency of the intermediate “thinking tokens.” Experiments revealed that this RL-SFT hybrid produces a model with more robust and interpretable reasoning.
A simplified diagram of the training pipeline is shown below:
Policy Optimization with GRPO
At the heart of DeepSeek R1’s RL process is Group Relative Policy Optimization (GRPO)—a variant of PPO designed specifically for reasoning tasks. GRPO introduces several key modifications:
-
Elimination of the Value Model:
Unlike PPO, which relies on an additional value function network to estimate long-term rewards, GRPO omits this component entirely. Instead, for each state (or prompt), the model generates a group of candidate outputs and calculates rewards using rule-based functions (e.g., accuracy and format rewards). -
Group-Based Advantage Estimation:
When given a specific prompt, the GRPO method generates multiple candidate responses. Each response is then evaluated using a reward function, which assigns a numerical score to indicate its quality. To understand how well each candidate performs relative to others in the group, the method computes two key statistical values from these reward scores: the mean reward across all candidates and the standard deviation.Using these, the relative advantage of each candidate is standarzied (i.e., by subtracting the mean reward from its individual reward and then dividing the result by the standard deviation) This normalized advantage indicates how much better (or worse) a candidate is relative to the group baseline.
-
Clipped Policy Update:
The GRPO algorithm updates its policy using a technique inspired by Proximal Policy Optimization (PPO), specifically its clipped surrogate loss function. The goal of this update is to improve the policy while preventing excessively large changes that could destabilize learning.For each candidate response, the update process evaluates an objective funciton that incorporates the ratio between the probabilities assigned by the current and previous policies. This ratio is then modified using a clipping mechanism, controlled by some hyper-parameters, which restricts how much the policy can be adjusted in a single update.
The objective function is computed as follows:
- If the advantage is positive (indicating that the action was beneficial), the ratio is clipped at slightly more than 1 (depending on the hyper-parameter mentioned above) to prevent excessive reinforcement
- If the action was suboptimal, the ratio is clipped at slightly less than 1 to prevent too strong of a penalty. This can be summarized mathematically as: [ L(\theta) = \frac{1}{G} \sum_{i=1}^G \mathbb{E}{(s, a_i) \sim \pi{\theta_t}}\left[ \begin{cases} \min \left(\frac{\pi_\theta(a_i|s)}{\pi_{\theta_t}(a_i|s)}, 1 + \epsilon \right) A(s, a_i) & \text{if } A(s, a_i) > 0, \ \max \left(\frac{\pi_\theta(a_i|s)}{\pi_{\theta_t}(a_i|s)}, 1 - \epsilon \right) A(s, a_i) & \text{if } A(s, a_i) < 0. \end{cases} \right] ]
By applying this clipping mechanism, GRPO ensures that the policy does not change too drastically in a single update step. This helps maintain stability and prevents overfitting to specific reward signals while still allowing gradual policy improvement based on relative advantages.
-
KL-Divergence Constraint:
To prevent the updated policy from drifting too far from the original (or reference) policy, a KL divergence penalty is integrated into the loss. This term constrains updates token-by-token, ensuring that the model’s output remains coherent and aligned with pretraining, thereby mitigating reward exploitation.
GRPO’s design reduces memory consumption and simplifies training by dropping the value model—a significant benefit when training large reasoning models with limited hardware resources (Das, 2025).
Transparent Thinking Tokens
For many of researchers not necessarily focusing on heavy maths, one of the standout outcomes of the RL-SFT process would be the production of transparent thinking tokens. These tokens capture the intermediate reasoning steps and are embedded within designated markers (e.g., <reasoning>
and </reasoning>
). This transparency offers multiple benefits:
-
Interpretability:
Users and developers can inspect the chain-of-thought to understand how the model arrived at its final answer. This transparency builds trust and facilitates debugging, especially in high-stakes domains. -
Alignment and Safety:
Transparent reasoning helps in verifying that the model’s thought process adheres to safety guidelines. It makes it easier to detect and mitigate potential biases or harmful reasoning patterns. -
Distillation for Smaller Models:
By explicitly capturing reasoning steps, DeepSeek R1 lays the groundwork for model distillation. Distilled models—derived from R1’s comprehensive reasoning process—retain much of the performance benefits with a fraction of the parameters. This is particularly valuable for on-premise solutions.
Cost Efficiency and On-Premise Benefits
DeepSeek R1’s innovations lead to significant downstream benefits:
- Efficient Distillation:
The transparent, structured reasoning data generated during GRPO-based training enables the distillation of smaller, more efficient models. These distilled models maintain robust reasoning performance but require far less compute and memory. - Secure On-Premise Deployment:
Organizations with strict data security requirements—such as healthcare providers managing sensitive patient information—can deploy these distilled models on-premise. This minimizes reliance on cloud services and helps ensure regulatory compliance.
Research Transparency and Repeatability
A persistent challenge in AI research is ensuring that results are transparent and reproducible. OpenAI’s o1 model, like many of its predecessors, is closed-source, and its training data, architectural details, and fine-tuning methods remain proprietary. This lack of openness makes independent validation and benchmarking extremely challenging, limiting rigorous external scrutiny to a small group of insider researchers.
In contrast, DeepSeek R1 embraces a more open approach. Although its training data is not fully public, DeepSeek has released its model architecture and code as open source. This openness has invigorated third-party validation efforts—researchers and independent labs are actively benchmarking DeepSeek R1 on open datasets. These external validations not only help confirm the model’s performance claims but also expose potential limitations and biases, fostering a more robust and collaborative research ecosystem.
The ability for third parties to replicate experiments using open datasets creates an environment of continuous improvement and accountability—a stark contrast to the “black-box” nature of closed-source models. This transparency and repeatability are crucial for building trust in AI systems, particularly in high-stakes applications.
Broader Influences and Benefits of DeepSeek R1
DeepSeek R1’s innovative methods have positive ramifications for both the research community and end-user organizations:
Enhanced Transparency and Trust
-
Auditability:
The model’s explicit chain-of-thought tokens allow organizations to audit its decision-making process. In sectors like healthcare or finance, where decisions must be explainable, this transparency is a game changer. -
Safety and Alignment:
Transparent reasoning facilitates better alignment with human values and safety protocols. Stakeholders can review and refine the model’s internal logic, reducing the risk of unintended behavior.
Distilled Models for Resource-Constrained Deployments
-
Efficient Distillation:
The structured reasoning produced by R1 makes it possible to create distilled models that maintain high reasoning performance while requiring significantly fewer parameters. This leads to lower computational demands and faster inference speeds. -
On-Premise Security:
For organizations that handle sensitive data—such as patient records in healthcare or proprietary information in finance—deploying smaller, distilled models on-premise is crucial. This not only mitigates privacy risks by keeping data local but also addresses regulatory and security concerns.
Enabling On-Premise and Secure AI Solutions
-
Customizable Deployments:
DeepSeek R1’s framework is designed to be adaptable, allowing enterprises to tailor the model for on-premise use. This is particularly beneficial for organizations that need to ensure data never leaves their secure infrastructure. -
Reduced Reliance on Cloud Services:
By providing a pathway to efficient, smaller models, R1 enables organizations to reduce reliance on third-party cloud providers. This minimizes exposure to data breaches and enhances overall security.
Conclusion
DeepSeek R1 represents a pivotal advance in training reasoning models. By leveraging inference-time compute and a sophisticated RL strategy—anchored by Group Relative Policy Optimization (GRPO)—R1 achieves excellent reasoning performance with reduced memory and compute demands. Its innovative use of group-based advantage estimation, transparent chain-of-thought tokens, and cost-efficient distillation not only improves model accuracy and interpretability but also facilitates secure on-premise deployments.
Moreover, DeepSeek’s commitment to research transparency and repeatability empowers the wider AI community. Unlike closed-source models such as OpenAI’s o1, DeepSeek R1’s open-source approach (even though its training data remains proprietary) enables vigorous third-party validations and collaborative improvements. This openness fosters trust, encourages innovation, and ultimately drives the field toward more robust and accessible AI systems.
These innovations pave the way for more accessible, reliable, and secure AI systems capable of tackling complex, real-world challenges. As the field continues to evolve, DeepSeek R1 and its training techniques are poised to redefine what is possible in AI reasoning.
References
- Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré, C., & Mirhoseini, A. (2024). Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787. Retrived from https://arxiv.org/abs/2407.21787
- Brriski, K. (2025, February 11). How Scaling Laws Drive Smarter, More Powerful AI. NVIDIA Blog. Retrived from https://blogs.nvidia.com/blog/ai-scaling-laws/
- Ma, N., Tong, S., Jia, H., Hu, H., Su, Y. C., Zhang, M., … & Xie, S. (2025). Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732. Retrived from https://arxiv.org/abs/2501.09732
- OpenAI. (2024, September 12). Introducing OpenAI o1-preview. OpenAI. Retrived from https://openai.com/index/introducing-openai-o1-preview/
- Pichka, E. (2025, January 31). Group Relative Policy Optimization (GRPO) Illustrated Breakdown & Explanation. Towards AI. Retrived from https://pub.towardsai.net/group-relative-policy-optimization-grpo-illustrated-breakdown-explanation-684e71b8a3f2
- Schoeninger, G. (2025, February 11). Why GRPO is Important and How It Works. Retrived from here
- Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., … & Guo, D. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Retrived from https://arxiv.org/abs/2402.03300
- Time. (2025, January 27). What to Know About DeepSeek, the Chinese AI Company Causing Stock Market Chaos. Time. Retrivied from https://www.yahoo.com/news/know-deepseek-chinese-ai-company-215549554.html
- Das, S. (2025, Feburary 26). Understanding the Math Behind GRPO — DeepSeek-R1-Zero. Yugen.ai. Retrived from https://medium.com/yugen-ai-technology-blog/understanding-the-math-behind-grpo-deepseek-r1-zero-9fb15e103a0a