Group Relative Policy Optimization (GRPO) Illustrated Breakdown

A simplified intro to GRPO, an efficient policy optimization method used for LLM reasoning training

Introduction

Reinforcement Learning (RL) has emerged as a powerful tool for enhancing Large Language Models (LLMs) after their initial training, particularly in reasoning-intensive tasks. DeepSeekโ€™s recent breakthroughs with DeepSeek-Math [2] and DeepSeek-R1 [3] models have demonstrated the remarkable potential of RL in improving mathematical reasoning and problem-solving abilities of LLMs.

These achievements were made possible through an innovative RL approach called Group Relative Policy Optimization (GRPO), which addresses the unique challenges of applying RL to language models. In this post, weโ€™ll dive deep into how GRPO works and why it represents a significant advancement in LLM training.

PPO vs GRPO

PPO vs GRPO Comparison

Proximal Policy Optimization (PPO) [1] has been the go-to algorithm for RL fine-tuning of language models. At its core, PPO is a policy gradient method that uses clipping to limit policy updates (gradients), preventing destructive large policy changes. The objective function for PPO can be written as:

\[J_{PPO}(\theta) = \mathbb{E}[s \sim P(S), a \sim \pi_{\theta_{old}}(A\mid s)] \left[\min\left(\frac{\pi_\theta(a\mid s)}{\pi_{\theta_{old}}(a\mid s)}A(s,a), \text{clip}\left(\frac{\pi_\theta(a\mid s)}{\pi_{\theta_{old}}(a\mid s)}, 1-\epsilon, 1+\epsilon\right)A(s,a)\right)\right]\]

Where GRPO, introduced in [2] builds upon PPOโ€™s foundation but introduces several key innovations that make it more efficient and better suited for language models:

  1. Eliminates the need for a value network, hence less memory/compute usage
  2. Uses group sampling for more efficient stable advantage estimation
  3. Uses a more conservative update mechanism through further penalizing both the objective and the rewards

GRPO: A Closer Look

GRPO Detailed

LLM as a Policy

In GRPO, the language model serves as the policy network (actor), taking a question \(q\) as input observation \(s\) and producing a sequence of tokens as actions. The policy distribution factors across tokens:

\[\pi_\theta(a\mid q) = \prod_{t=1}^N \pi_\theta(a_t \mid q, a_{ < t})\]

Note: In the original paper [2], they use \(o_t\) to denote the output token at timestep \(t\). Whereas we use \(a_t\) instead to conform with standard RL notation of action.

Sequential Token Generation

Sequential Generation

The generation process is inherently sequential because of auto-regressive nature of transformers/LLMs:

  1. Each token is generated conditionally on previous tokens
  2. The policy network (LLM) maintains a running context
  3. Each token generation step can be viewed as an action \(a_t\) in the RL framework

Reward and Advantage Calculation

For each generated sequence, GRPO computes per-token rewards as follows:

\[r_t = r_\phi(q,a_{\leq t}) - \beta \log\frac{\pi_\theta(a_t \mid q,a_{ < t})}{\pi_{ref}(a_t \mid q, a_{ < t})}\]

Instead of using a value network, GRPO estimates baseline advantages \(A\) by normalizing a group (batch) of rewards obtained from sampling multiple different outputs from the reference policy produced in response to the same question as input. :

\[\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}\]

The GRPO Objective

for each question ๐‘ž, GRPO samples a group of outputs {๐‘œ1, ๐‘œ2, ยท ยท ยท , ๐‘œ๐บ} from the old policy ๐œ‹๐œƒ๐‘œ๐‘™๐‘‘ and then optimizes the policy model by maximizing the GRPO objective. The complete GRPO objective brings everything together:

\[J_{GRPO}(\theta) = \frac{1}{G}\sum_{i=1}^G \frac{1}{\mid a_i \mid}\sum_{t=1}^{\mid a_i \mid} \left\{\min\left[\frac{\pi_\theta(a_{i,t} \mid s,a_{i, < t})}{\pi_{\theta_{old}}(a_{i, t}\mid s, a_{i, < t})}\hat{A}_{i,t}, \text{clip}\left(\frac{\pi_\theta(a_{i,t}\mid s, a_{ i, < t })}{\pi_{\theta_{old}}(a_{i, t} \mid s, a_{i, < t})}, 1-\epsilon, 1+\epsilon\right)\hat{A}_{i,t}\right] - \beta D_{KL}[\pi_\theta \mid \mid \pi_{ref}]\right\}\]

This objective:

  1. Averages over both groups and sequence lengths
  2. Uses clipping for conservative updates
  3. Includes an estimate of the KL divergence as a penalty to prevent large deviations from the reference model

GRPO Objective

Conclusion

GRPO represents a significant advancement in applying RL to language models. By eliminating the need for a value network and introducing group-relative advantage estimation, it provides a more efficient and stable training process. The success of DeepSeek-Math and DeepSeek-R1 demonstrates the practical benefits of this approach.

The key innovations of GRPO - group sampling, relative advantage estimation, and the elimination of the value network - provide a blueprint for future developments in LLM training. As we continue to push the boundaries of what language models can achieve, techniques like GRPO will be crucial in unlocking their full potential.

References

[1] Schulman, John, et al. Proximal Policy Optimization Algorithms. arXiv:1707.06347, arXiv, 28 Aug. 2017. arXiv.org, https://doi.org/10.48550/arXiv.1707.06347.

[2] Shao, Zhihong, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300, arXiv, 27 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2402.03300.

[3] DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, arXiv, 22 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12948.