A simplified intro to GRPO, an efficient policy optimization method used for LLM reasoning training
Reinforcement Learning (RL) has emerged as a powerful tool for enhancing Large Language Models (LLMs) after their initial training, particularly in reasoning-intensive tasks. DeepSeek’s recent breakthroughs with DeepSeek-Math [2] and DeepSeek-R1 [3] models have demonstrated the remarkable potential of RL in improving mathematical reasoning and problem-solving abilities of LLMs.
These achievements were made possible through an innovative RL approach called Group Relative Policy Optimization (GRPO), which addresses the unique challenges of applying RL to language models. In this post, we’ll dive deep into how GRPO works and why it represents a significant advancement in LLM training.
Proximal Policy Optimization (PPO) [1] has been the go-to algorithm for RL fine-tuning of language models. At its core, PPO is a policy gradient method that uses clipping to limit policy updates (gradients), preventing destructive large policy changes. The objective function for PPO can be written as:
Where GRPO, introduced in [2] builds upon PPO’s foundation but introduces several key innovations that make it more efficient and better suited for language models:
In GRPO, the language model serves as the policy network (actor), taking a question
Note: In the original paper [2], they use
The generation process is inherently sequential because of auto-regressive nature of transformers/LLMs:
For each generated sequence, GRPO computes per-token rewards as follows:
Instead of using a value network, GRPO estimates baseline advantages
for each question 𝑞, GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the old policy 𝜋𝜃𝑜𝑙𝑑 and then optimizes the policy model by maximizing the GRPO objective. The complete GRPO objective brings everything together:
This objective:
GRPO represents a significant advancement in applying RL to language models. By eliminating the need for a value network and introducing group-relative advantage estimation, it provides a more efficient and stable training process. The success of DeepSeek-Math and DeepSeek-R1 demonstrates the practical benefits of this approach.
The key innovations of GRPO - group sampling, relative advantage estimation, and the elimination of the value network - provide a blueprint for future developments in LLM training. As we continue to push the boundaries of what language models can achieve, techniques like GRPO will be crucial in unlocking their full potential.
[1] Schulman, John, et al. Proximal Policy Optimization Algorithms. arXiv:1707.06347, arXiv, 28 Aug. 2017. arXiv.org, https://doi.org/10.48550/arXiv.1707.06347.
[2] Shao, Zhihong, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300, arXiv, 27 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2402.03300.
[3] DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, arXiv, 22 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12948.