Close Menu
Ztoog
    What's Hot
    Technology

    The entire state of Illinois is going to be crawling with cicadas

    Science

    There’s a huge radioactive slab of volcanic granite buried on the moon

    Gadgets

    Don’t sit and watch until after snagging one of these 14 early Amazon Prime Day television deals

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models
    AI

    Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

    Facebook Twitter Pinterest WhatsApp
    Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Reinforcement studying (RL) performs an important position in scaling language fashions, enabling them to unravel advanced duties akin to competition-level arithmetic and programming by way of deeper reasoning. However, reaching secure and dependable coaching dynamics is a problem when scaling RL with bigger computational assets. Current state-of-the-art algorithms, akin to GRPO, battle with severe stability points throughout the coaching of gigantic language fashions, usually leading to catastrophic failures. These instabilities come up from incorrect use of significance sampling weight functions, which introduce high-variance noise. This noise accumulates with longer responses and is worsened by clipping mechanisms. This causes mannequin collapse and hinders progress.

    Existing strategies like PPO and GRPO depend on mechanisms like clipping to handle off-policy studying challenges the place responses are taken from outdated insurance policies. However, these approaches face limitations on account of their ill-posed aims, significantly in giant fashions dealing with long-response duties. GRPO’s token-level significance sampling introduces high-variance noise and irreversible mannequin collapse. Attempts to get better from collapse by way of hyperparameter tuning or checkpoint restoration fail, highlighting a elementary design flaw. The mismatch between token-level corrections and sequence-level rewards emphasizes the want for a brand new method that optimizes straight at the sequence degree to make sure stability and scalability.

    Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), an RL algorithm designed to coach LLMs. GSPO’s essential innovation lies in its theoretically grounded significance ratio, derived from sequence probability, which aligns with the ideas of significance sampling. Moreover, it calculates normalized rewards as benefits for a number of responses to a question, selling consistency between sequence-level rewards and optimization objectives. Empirical evaluations reveal that GSPO considerably outperforms GRPO in stability, effectivity, and total efficiency. By resolving stability challenges in coaching giant Mixture-of-Experts (MoE) fashions, GSPO eliminates the want for advanced stabilization methods.

    Researchers use a cold-start mannequin fine-tuned from Qwen3-30B-A3B-Base for the experiment, reporting the coaching reward curves and the mannequin efficiency curves throughout AIME’24, LiveCodeBench, and CodeForces benchmarks. During coaching, rollout information in every batch is break up into 4 mini-batches for gradient updates. GSPO clips complete responses somewhat than particular person tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This results in a two-order-of-magnitude distinction in clipped token fractions in comparison with GRPO. Despite eradicating extra tokens for gradient estimation, GSPO achieves greater coaching effectivity. This outcome highlights the inefficiency of GRPO’s noisy token-level estimates.

    GSPO affords important benefits for MoE coaching by stabilizing the course of by way of constant skilled activations throughout gradient updates, in contrast to GRPO, which struggles with expert-activation volatility. This removes the want for advanced options like Routing Replay, simplifying the infrastructure and permitting fashions to make the most of their full capability. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, making it extra strong to precision mismatch. This permits direct use of inference engine likelihoods, avoiding pricey recomputation and bettering effectivity in partial rollouts and multi-turn RL. GSPO additionally streamlines RL infrastructure for large-scale language mannequin coaching.

    In conclusion, researchers launched Group Sequence Policy Optimization (GSPO), an RL algorithm designed for coaching LLMs. GSPO builds on the ideas of significance sampling and introduces sequence-level clipping, rewarding, and optimization to beat the instability and inefficiency seen in GRPO. Its superior efficiency in coaching stability, effectivity, and scalability, significantly for MoE fashions, emphasizes its significance as a robust algorithmic basis. The developments made doable by GSPO have performed a key position in the outstanding efficiency of the Qwen3 fashions. Building on GSPO as a foundational method, researchers plan to develop RL strategies, opening the door for groundbreaking progress in AI.


    Check out the Paper. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The put up Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models appeared first on MarkTechPost.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Mobile

    Check out these deals on the Poco X6 and M6 Pro, Galaxy S23 FE and iPhone 15

    The Poco X6 is powered by the Snapdragon 7s Gen 2 – whereas not as…

    Gadgets

    Oh hey, Google just announced the Pixel Fold

    Look, I’m not going to take a seat right here and fake that your entire…

    Crypto

    Ethereum ETFs Face Lackluster Debut From Small Investors: Is The Hype Fizzling Out?

    Yesterday’s launch of futures-based Ethereum (ETH) exchange-traded funds (ETFs) delivered underwhelming outcomes, with shallow buying…

    Technology

    Sources: Appin co-founder Rajat Khare used law firms to threaten outlets in the US, UK, and other countries to kill stories about the Indian hack-for-hire firm (Lachlan Cartwright/The Daily Beast)

    Lachlan Cartwright / The Daily Beast: Sources: Appin co-founder Rajat Khare used law firms to…

    Mobile

    American Pie? For over an hour on Friday, it was the day Apple Music (and other services) died

    Several Apple Services had been down on Friday based on the firm’s System Status assist…

    Our Picks
    Science

    Want to Store a Message in DNA? That’ll Be $1,000

    Mobile

    Here’s Apple’s reason why sideloading isn’t coming to other countries

    Technology

    A year after Musk’s takeover, X says an average user spends 32 minutes per day on the platform

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    The Future

    Apple redesigns its Shortcuts app in iOS 17 to be easier to use

    Science

    How indefinite causality could lead us to a theory of quantum gravity

    Technology

    Beyond Prompt-and-Pray – O’Reilly

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.