Close Menu
Ztoog
    What's Hot
    Technology

    Apple Watch Series 9 and Ultra 2 Ban: The Latest and What You Need To Know

    AI

    What’s next for generative video

    Science

    Scientists 3D print a robotic hand with human-like bones and tendons 

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      SEC Vs. Justin Sun Case Ends In $10M Settlement

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

    Ztoog
    Home » Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models
    AI

    Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

    Facebook Twitter Pinterest WhatsApp
    Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Reinforcement studying (RL) performs an important position in scaling language fashions, enabling them to unravel advanced duties akin to competition-level arithmetic and programming by way of deeper reasoning. However, reaching secure and dependable coaching dynamics is a problem when scaling RL with bigger computational assets. Current state-of-the-art algorithms, akin to GRPO, battle with severe stability points throughout the coaching of gigantic language fashions, usually leading to catastrophic failures. These instabilities come up from incorrect use of significance sampling weight functions, which introduce high-variance noise. This noise accumulates with longer responses and is worsened by clipping mechanisms. This causes mannequin collapse and hinders progress.

    Existing strategies like PPO and GRPO depend on mechanisms like clipping to handle off-policy studying challenges the place responses are taken from outdated insurance policies. However, these approaches face limitations on account of their ill-posed aims, significantly in giant fashions dealing with long-response duties. GRPO’s token-level significance sampling introduces high-variance noise and irreversible mannequin collapse. Attempts to get better from collapse by way of hyperparameter tuning or checkpoint restoration fail, highlighting a elementary design flaw. The mismatch between token-level corrections and sequence-level rewards emphasizes the want for a brand new method that optimizes straight at the sequence degree to make sure stability and scalability.

    Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), an RL algorithm designed to coach LLMs. GSPO’s essential innovation lies in its theoretically grounded significance ratio, derived from sequence probability, which aligns with the ideas of significance sampling. Moreover, it calculates normalized rewards as benefits for a number of responses to a question, selling consistency between sequence-level rewards and optimization objectives. Empirical evaluations reveal that GSPO considerably outperforms GRPO in stability, effectivity, and total efficiency. By resolving stability challenges in coaching giant Mixture-of-Experts (MoE) fashions, GSPO eliminates the want for advanced stabilization methods.

    Researchers use a cold-start mannequin fine-tuned from Qwen3-30B-A3B-Base for the experiment, reporting the coaching reward curves and the mannequin efficiency curves throughout AIME’24, LiveCodeBench, and CodeForces benchmarks. During coaching, rollout information in every batch is break up into 4 mini-batches for gradient updates. GSPO clips complete responses somewhat than particular person tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This results in a two-order-of-magnitude distinction in clipped token fractions in comparison with GRPO. Despite eradicating extra tokens for gradient estimation, GSPO achieves greater coaching effectivity. This outcome highlights the inefficiency of GRPO’s noisy token-level estimates.

    GSPO affords important benefits for MoE coaching by stabilizing the course of by way of constant skilled activations throughout gradient updates, in contrast to GRPO, which struggles with expert-activation volatility. This removes the want for advanced options like Routing Replay, simplifying the infrastructure and permitting fashions to make the most of their full capability. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, making it extra strong to precision mismatch. This permits direct use of inference engine likelihoods, avoiding pricey recomputation and bettering effectivity in partial rollouts and multi-turn RL. GSPO additionally streamlines RL infrastructure for large-scale language mannequin coaching.

    In conclusion, researchers launched Group Sequence Policy Optimization (GSPO), an RL algorithm designed for coaching LLMs. GSPO builds on the ideas of significance sampling and introduces sequence-level clipping, rewarding, and optimization to beat the instability and inefficiency seen in GRPO. Its superior efficiency in coaching stability, effectivity, and scalability, significantly for MoE fashions, emphasizes its significance as a robust algorithmic basis. The developments made doable by GSPO have performed a key position in the outstanding efficiency of the Qwen3 fashions. Building on GSPO as a foundational method, researchers plan to develop RL strategies, opening the door for groundbreaking progress in AI.


    Check out the Paper. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The put up Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models appeared first on MarkTechPost.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    Crypto Analyst Who Sold The Bitcoin Top Reveals How To Buy And Sell At The Perfect Time

    Long-term crypto investor Jelle who caught the Bitcoin high in 2021 not too long ago…

    Gadgets

    3 Best Deals From Roborock’s Robot Vacuum Sale

    Scan via our Best Robot Vacuums information and you may see Roborock’s identify come up…

    Mobile

    Android 14 QPR1 Beta 1: Every new feature in Google’s latest update

    Even although the secure launch of Android 14 has but to drop, Google went forward…

    The Future

    Seven faculty members elected to the American Academy of Arts and Sciences | Ztoog

    Seven MIT faculty members are amongst 204 leaders from academia, enterprise, public affairs, the humanities…

    Crypto

    Bitcoin Institutional Outflows Touch 4-Month High As BTC Struggles

    Institution crypto buyers appear to be quickly pulling out of the market and Bitcoin was…

    Our Picks
    Mobile

    Tecno Phantom V Flip handled on video ahead of Friday’s official unveiling

    Crypto

    Ethereum Holds Multi-Year Bullish Structure – Time For A Comeback?

    Crypto

    Sam Bankman-Fried gets 25 years in prison for fraud and money-laundering at FTX

    Categories
    • AI (1,560)
    • Crypto (1,827)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    The Future

    Ukrainian AI attack drones may be killing without human oversight

    Gadgets

    The Humane AI Pin is a bizarre cross between Google Glass and a pager

    The Future

    The Travel Holiday Gift Guide for Pop Culture Enthusiasts

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.