Reinforcement studying (RL) performs an important position in scaling language fashions, enabling them to unravel advanced duties akin to competition-level arithmetic and programming by way of deeper reasoning. However, reaching secure and dependable coaching dynamics is a problem when scaling RL with bigger computational assets. Current state-of-the-art algorithms, akin to GRPO, battle with severe stability points throughout the coaching of gigantic language fashions, usually leading to catastrophic failures. These instabilities come up from incorrect use of significance sampling weight functions, which introduce high-variance noise. This noise accumulates with longer responses and is worsened by clipping mechanisms. This causes mannequin collapse and hinders progress.
Existing strategies like PPO and GRPO depend on mechanisms like clipping to handle off-policy studying challenges the place responses are taken from outdated insurance policies. However, these approaches face limitations on account of their ill-posed aims, significantly in giant fashions dealing with long-response duties. GRPO’s token-level significance sampling introduces high-variance noise and irreversible mannequin collapse. Attempts to get better from collapse by way of hyperparameter tuning or checkpoint restoration fail, highlighting a elementary design flaw. The mismatch between token-level corrections and sequence-level rewards emphasizes the want for a brand new method that optimizes straight at the sequence degree to make sure stability and scalability.
Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), an RL algorithm designed to coach LLMs. GSPO’s essential innovation lies in its theoretically grounded significance ratio, derived from sequence probability, which aligns with the ideas of significance sampling. Moreover, it calculates normalized rewards as benefits for a number of responses to a question, selling consistency between sequence-level rewards and optimization objectives. Empirical evaluations reveal that GSPO considerably outperforms GRPO in stability, effectivity, and total efficiency. By resolving stability challenges in coaching giant Mixture-of-Experts (MoE) fashions, GSPO eliminates the want for advanced stabilization methods.
Researchers use a cold-start mannequin fine-tuned from Qwen3-30B-A3B-Base for the experiment, reporting the coaching reward curves and the mannequin efficiency curves throughout AIME’24, LiveCodeBench, and CodeForces benchmarks. During coaching, rollout information in every batch is break up into 4 mini-batches for gradient updates. GSPO clips complete responses somewhat than particular person tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This results in a two-order-of-magnitude distinction in clipped token fractions in comparison with GRPO. Despite eradicating extra tokens for gradient estimation, GSPO achieves greater coaching effectivity. This outcome highlights the inefficiency of GRPO’s noisy token-level estimates.
GSPO affords important benefits for MoE coaching by stabilizing the course of by way of constant skilled activations throughout gradient updates, in contrast to GRPO, which struggles with expert-activation volatility. This removes the want for advanced options like Routing Replay, simplifying the infrastructure and permitting fashions to make the most of their full capability. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, making it extra strong to precision mismatch. This permits direct use of inference engine likelihoods, avoiding pricey recomputation and bettering effectivity in partial rollouts and multi-turn RL. GSPO additionally streamlines RL infrastructure for large-scale language mannequin coaching.
In conclusion, researchers launched Group Sequence Policy Optimization (GSPO), an RL algorithm designed for coaching LLMs. GSPO builds on the ideas of significance sampling and introduces sequence-level clipping, rewarding, and optimization to beat the instability and inefficiency seen in GRPO. Its superior efficiency in coaching stability, effectivity, and scalability, significantly for MoE fashions, emphasizes its significance as a robust algorithmic basis. The developments made doable by GSPO have performed a key position in the outstanding efficiency of the Qwen3 fashions. Building on GSPO as a foundational method, researchers plan to develop RL strategies, opening the door for groundbreaking progress in AI.
Check out the Paper. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.
The put up Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models appeared first on MarkTechPost.