A simplified intro to GRPO, an efficient policy optimization method used for LLM reasoning training
Reinforcement Learning (RL) has emerged as a powerful tool for enhancing Large Language Models (LLMs) after their initial training, particularly in reasoning-intensive tasks. DeepSeekโs recent breakthroughs with DeepSeek-Math [2] and DeepSeek-R1 [3] models have demonstrated the remarkable potential of RL in improving mathematical reasoning and problem-solving abilities of LLMs.
These achievements were made possible through an innovative RL approach called Group Relative Policy Optimization (GRPO), which addresses the unique challenges of applying RL to language models. In this post, weโll dive deep into how GRPO works and why it represents a significant advancement in LLM training.
Proximal Policy Optimization (PPO) [1] has been the go-to algorithm for RL fine-tuning of language models. At its core, PPO is a policy gradient method that uses clipping to limit policy updates (gradients), preventing destructive large policy changes. The objective function for PPO can be written as:
\[J_{PPO}(\theta) = \mathbb{E}[s \sim P(S), a \sim \pi_{\theta_{old}}(A\mid s)] \left[\min\left(\frac{\pi_\theta(a\mid s)}{\pi_{\theta_{old}}(a\mid s)}A(s,a), \text{clip}\left(\frac{\pi_\theta(a\mid s)}{\pi_{\theta_{old}}(a\mid s)}, 1-\epsilon, 1+\epsilon\right)A(s,a)\right)\right]\]Where GRPO, introduced in [2] builds upon PPOโs foundation but introduces several key innovations that make it more efficient and better suited for language models:
In GRPO, the language model serves as the policy network (actor), taking a question \(q\) as input observation \(s\) and producing a sequence of tokens as actions. The policy distribution factors across tokens:
\[\pi_\theta(a\mid q) = \prod_{t=1}^N \pi_\theta(a_t \mid q, a_{ < t})\]Note: In the original paper [2], they use \(o_t\) to denote the output token at timestep \(t\). Whereas we use \(a_t\) instead to conform with standard RL notation of action.
The generation process is inherently sequential because of auto-regressive nature of transformers/LLMs:
For each generated sequence, GRPO computes per-token rewards as follows:
\[r_t = r_\phi(q,a_{\leq t}) - \beta \log\frac{\pi_\theta(a_t \mid q,a_{ < t})}{\pi_{ref}(a_t \mid q, a_{ < t})}\]Instead of using a value network, GRPO estimates baseline advantages \(A\) by normalizing a group (batch) of rewards obtained from sampling multiple different outputs from the reference policy produced in response to the same question as input. :
\[\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}\]for each question ๐, GRPO samples a group of outputs {๐1, ๐2, ยท ยท ยท , ๐๐บ} from the old policy ๐๐๐๐๐ and then optimizes the policy model by maximizing the GRPO objective. The complete GRPO objective brings everything together:
\[J_{GRPO}(\theta) = \frac{1}{G}\sum_{i=1}^G \frac{1}{\mid a_i \mid}\sum_{t=1}^{\mid a_i \mid} \left\{\min\left[\frac{\pi_\theta(a_{i,t} \mid s,a_{i, < t})}{\pi_{\theta_{old}}(a_{i, t}\mid s, a_{i, < t})}\hat{A}_{i,t}, \text{clip}\left(\frac{\pi_\theta(a_{i,t}\mid s, a_{ i, < t })}{\pi_{\theta_{old}}(a_{i, t} \mid s, a_{i, < t})}, 1-\epsilon, 1+\epsilon\right)\hat{A}_{i,t}\right] - \beta D_{KL}[\pi_\theta \mid \mid \pi_{ref}]\right\}\]This objective:
GRPO represents a significant advancement in applying RL to language models. By eliminating the need for a value network and introducing group-relative advantage estimation, it provides a more efficient and stable training process. The success of DeepSeek-Math and DeepSeek-R1 demonstrates the practical benefits of this approach.
The key innovations of GRPO - group sampling, relative advantage estimation, and the elimination of the value network - provide a blueprint for future developments in LLM training. As we continue to push the boundaries of what language models can achieve, techniques like GRPO will be crucial in unlocking their full potential.
[1] Schulman, John, et al. Proximal Policy Optimization Algorithms. arXiv:1707.06347, arXiv, 28 Aug. 2017. arXiv.org, https://doi.org/10.48550/arXiv.1707.06347.
[2] Shao, Zhihong, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300, arXiv, 27 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2402.03300.
[3] DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, arXiv, 22 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12948.