Close Menu
Ztoog
    What's Hot
    The Future

    Why Tech Innovators Are Turning Their Backs On Belarus

    Science

    Illegal Trawlers Are No Match for Undersea Sculptures

    The Future

    The Value of Professional Website Marketing 

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Common Security Mistakes Made By Businesses and How to Avoid Them

      What time tracking metrics should you track and why?

      Are entangled qubits following a quantum Moore’s law?

      Disneyland’s 70th Anniversary Brings Cartoony Chaos to This Summer’s Celebration

      Story of military airfield in Afghanistan that Biden left in 2021

    • Technology

      How To Come Back After A Layoff

      Are Democrats fumbling a golden opportunity?

      Crypto elite increasingly worried about their personal safety

      Deep dive on the evolution of Microsoft's relationship with OpenAI, from its $1B investment in 2019 through Copilot rollouts and ChatGPT's launch to present day (Bloomberg)

      New leak reveals iPhone Fold won’t look like the Galaxy Z Fold 6 at all

    • Gadgets

      Google shows off Android XR-based glasses, announces Warby Parker team-up

      The market’s down, but this OpenAI for the stock market can help you trade up

      We Hand-Picked the 24 Best Deals From the 2025 REI Anniversary Sale

      “Google wanted that”: Nextcloud decries Android permissions as “gatekeeping”

      Google Tests Automatic Password-to-Passkey Conversion On Android

    • Mobile

      Forget screens: more details emerge on the mysterious Jony Ive + OpenAI device

      Android 16 QPR1 lets you check what fingerprints you’ve enrolled on your Pixel phone

      The Forerunner 570 & 970 have made Garmin’s tiered strategy clearer than ever

      The iPhone Fold is now being tested with an under-display camera

      T-Mobile takes over one of golf’s biggest events, unleashes unique experiences

    • Science

      AI Is Eating Data Center Power Demand—and It’s Only Getting Worse

      Liquid physics: Inside the lab making black hole analogues on Earth

      Risk of a star destroying the solar system is higher than expected

      Do these Buddhist gods hint at the purpose of China’s super-secret satellites?

      From Espresso to Eco-Brick: How Coffee Waste Fuels 3D-Printed Design

    • AI

      AI learns how vision and sound are connected, without human intervention | Ztoog

      How AI is introducing errors into courtrooms

      With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

      Google DeepMind’s new AI agent cracks real-world problems better than humans can

      Study shows vision-language models can’t handle queries with negation words | Ztoog

    • Crypto

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

      Is Bitcoin Bull Run Back? Daily RSI Shows Only Mild Bullish Momentum

      Robinhood grows its footprint in Canada by acquiring WonderFi

      HashKey Group Announces Launch of HashKey Global MENA with VASP License in UAE

    Ztoog
    Home » Meet Medusa: An Efficient Machine Learning Framework for Accelerating Large Language Models (LLMs) Inference with Multiple Decoding Heads
    AI

    Meet Medusa: An Efficient Machine Learning Framework for Accelerating Large Language Models (LLMs) Inference with Multiple Decoding Heads

    Facebook Twitter Pinterest WhatsApp
    Meet Medusa: An Efficient Machine Learning Framework for Accelerating Large Language Models (LLMs) Inference with Multiple Decoding Heads
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    The most up-to-date development within the subject of Artificial Intelligence (AI), i.e., Large Language Models (LLMs), has demonstrated some nice enchancment in language manufacturing. With mannequin sizes reaching billions of parameters, these fashions are getting into each area, starting from healthcare and finance to schooling.

    Though these fashions have proven wonderful capabilities, the event of the mannequin’s measurement has led to an elevated inference latency, which poses an issue for real-world purposes. Memory-bound operations signify the principle bottleneck in LLM inference, as it’s inefficient to move all mannequin parameters from High Bandwidth Memory (HBM) to the accelerator’s cache throughout auto-regressive decoding.

    Researchers have been placing in efforts to discover a resolution to those limitations, certainly one of which is to lower the variety of decoding steps and enhance the arithmetic depth of the decoding course of. Using a smaller draft mannequin for speculative decoding, which produces a collection of tokens which can be then improved upon by the larger authentic mannequin, has been steered. However, there are difficulties with incorporating a draft mannequin right into a distributed system.

    To overcome these challenges, a staff of researchers in a latest research has introduced MEDUSA, an environment friendly strategy that enhances LLM inference by incorporating further decoding heads to foretell a number of subsequent tokens in parallel. It makes use of the spine mannequin’s quite a few decoding heads to hurry up inference. These heads overcome the difficulties of speculative decoding by concurrently predicting quite a few tokens. 

    MEDUSA doesn’t require a separate draft mannequin like speculative decoding requires, which makes it able to getting simply built-in into present LLM methods, even in dispersed conditions. The staff has shared that MEDUSA builds a number of candidate continuations in every decoding part and verifies them concurrently utilizing a tree-based consideration mechanism. By using parallel processing, MEDUSA lowers the variety of essential decoding steps whereas introducing little or no overhead by way of single-step latency.

    Two new insights have been added to MEDUSA. First, quite a few candidate continuations have been generated utilizing MEDUSA heads, they usually have been verified concurrently. Secondly, an acceptance process has been used to decide on appropriate candidates. The staff has shared the rejection sampling technique utilized in speculative decoding, which a temperature-based threshold can successfully substitute to deal with deviations.

    The research has steered two strategies for fine-tuning LLMs’ predictive MEDUSA heads, that are as follows.

    1. MEDUSA-1: This permits lossless inference acceleration by instantly fine-tuning MEDUSA on prime of a frozen spine LLM. MEDUSA-1 has been steered for use when incorporating MEDUSA into an present mannequin or in settings with restricted computational assets. It makes use of much less reminiscence and might be made much more environment friendly by making use of quantization methods.
    1. MEDUSA-2: This methodology adjusts MEDUSA and the principle LLM concurrently. While it presents a higher speedup and improved prediction accuracy for MEDUSA heads, it necessitates a singular coaching recipe to keep up the spine mannequin’s performance. MEDUSA-2 is suitable when assets are plentiful and permits simultaneous coaching of MEDUSA heads and the spine mannequin with out sacrificing output high quality or next-token prediction capacity.

    The analysis has additionally steered a number of additions to boost or broaden using MEDUSA. These embody a traditional acceptance scheme to extend the acceptance charge with out sacrificing technology high quality and a self-distillation methodology within the absence of coaching information. The staff has shared that the analysis strategy of MEDUSA included testing on fashions of various sizes and coaching protocols. The outcomes have demonstrated that MEDUSA-1 can speed up information by greater than 2.2 occasions with out sacrificing technology high quality. Moreover, the acceleration is improved to 2.3-3.6× utilizing MEDUSA-2. 


    Check out the Paper and Github. All credit score for this analysis goes to the researchers of this challenge. Also, don’t overlook to observe us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

    If you want our work, you’ll love our e-newsletter..

    Don’t Forget to affix our Telegram Channel


    Tanya Malhotra is a closing yr undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
    She is a Data Science fanatic with good analytical and important considering, alongside with an ardent curiosity in buying new expertise, main teams, and managing work in an organized method.


    🧑‍💻 [FREE AI WEBINAR]’LangChain for Multimodal Apps: Chat With Text/Image Data’ (Jan 26, 2024)

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    AI

    Study shows vision-language models can’t handle queries with negation words | Ztoog

    AI

    How a new type of AI is helping police skirt facial recognition bans

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Technology

    Best Indoor Security Cameras for 2023

    $70 at Amazon Wyze Cam Pan v2 Best general indoor safety digital camera $118 at…

    The Future

    Nvidia, TSMC CEOs Meet as Global AI Chip Supply Remains Tight

    The CEOs of two of probably the most influential chip companies on the earth have…

    Technology

    Some Montana residents speak about the state's highly politicized TikTok ban, including one of the influencers who is suing the state to overturn the ban (Lisa Bonos/Washington Post)

    Lisa Bonos / Washington Post: Some Montana residents speak about the state’s highly politicized TikTok…

    The Future

    Robot with sense of touch grabs ocean trash without harming sea life

    The robotic arm gripping a seashellXun Zhao An synthetic pores and skin helps a robotic…

    The Future

    Garmin Forerunner 265 Review: It Isn’t Cheap, But Has a Lot to Offer Serious(ish) Runners

    The working display actually shines on the Forerunner 265, however I used to be additionally…

    Our Picks
    Science

    Ticks and the Diseases They Carry Are Spreading. Can This Drug Stamp Them Out?

    Gadgets

    Get Bose’s latest QuietComfort Ultra headphones at their lowest price ever, just in time for Christmas

    Gadgets

    Apple reportedly developing foldable iPhone and iPad

    Categories
    • AI (1,489)
    • Crypto (1,750)
    • Gadgets (1,801)
    • Mobile (1,846)
    • Science (1,861)
    • Technology (1,797)
    • The Future (1,643)
    Most Popular
    Gadgets

    The optical disc onslaught continues, with LG quitting Blu-ray players

    Technology

    Intel has a new plan to curb greenhouse gas emissions during chip manufacturing

    Gadgets

    Google’s Stadia Controller salvage operation will run for another year

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.