Close Menu
Ztoog
    What's Hot
    Science

    Perseid meteor shower: How to spot the Perseids in 2023

    Crypto

    Controversial Social Media Figure Has Something To Say About Bitcoin ETF

    Science

    Majestic photo shows China’s Tiangong space station in all its glory

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » New techniques efficiently accelerate sparse tensors for massive AI models | Ztoog
    AI

    New techniques efficiently accelerate sparse tensors for massive AI models | Ztoog

    Facebook Twitter Pinterest WhatsApp
    New techniques efficiently accelerate sparse tensors for massive AI models | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Researchers from MIT and NVIDIA have developed two techniques that accelerate the processing of sparse tensors, a sort of knowledge construction that’s used for high-performance computing duties. The complementary techniques might end in vital enhancements to the efficiency and energy-efficiency of methods just like the massive machine-learning models that drive generative synthetic intelligence.

    Tensors are information buildings utilized by machine-learning models. Both of the brand new strategies search to efficiently exploit what’s often called sparsity — zero values — within the tensors. When processing these tensors, one can skip over the zeros and save on each computation and reminiscence. For occasion, something multiplied by zero is zero, so it may possibly skip that operation. And it may possibly compress the tensor (zeros don’t must be saved) so a bigger portion may be saved in on-chip reminiscence.

    However, there are a number of challenges to exploiting sparsity. Finding the nonzero values in a big tensor isn’t any straightforward activity. Existing approaches typically restrict the places of nonzero values by imposing a sparsity sample to simplify the search, however this limits the number of sparse tensors that may be processed efficiently.

    Another problem is that the variety of nonzero values can fluctuate in numerous areas of the tensor. This makes it troublesome to find out how a lot area is required to retailer completely different areas in reminiscence. To be sure the area matches, more room is commonly allotted than is required, inflicting the storage buffer to be underutilized. This will increase off-chip reminiscence site visitors, which will increase power consumption.

    The MIT and NVIDIA researchers crafted two options to deal with these issues. For one, they developed a way that enables the {hardware} to efficiently discover the nonzero values for a greater diversity of sparsity patterns.

    For the opposite resolution, they created a technique that may deal with the case the place the info don’t slot in reminiscence, which will increase the utilization of the storage buffer and reduces off-chip reminiscence site visitors.

    Both strategies increase the efficiency and cut back the power calls for of {hardware} accelerators particularly designed to hurry up the processing of sparse tensors.

    “Typically, when you use more specialized or domain-specific hardware accelerators, you lose the flexibility that you would get from a more general-purpose processor, like a CPU. What stands out with these two works is that we show that you can still maintain flexibility and adaptability while being specialized and efficient,” says Vivienne Sze, affiliate professor within the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the Research Laboratory of Electronics (RLE), and co-senior creator of papers on each advances.

    Her co-authors embody lead authors Yannan Nellie Wu PhD ’23 and Zi Yu Xue, {an electrical} engineering and laptop science graduate pupil; and co-senior creator Joel Emer, an MIT professor of the follow in laptop science and electrical engineering and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), in addition to others at NVIDIA. Both papers might be offered on the IEEE/ACM International Symposium on Microarchitecture.

    HighLight: Efficiently discovering zero values

    Sparsity can come up within the tensor for quite a lot of causes. For instance, researchers typically “prune” pointless items of the machine-learning models by changing some values within the tensor with zeros, creating sparsity. The diploma of sparsity (proportion of zeros) and the places of the zeros can fluctuate for completely different models.

    To make it simpler to seek out the remaining nonzero values in a mannequin with billions of particular person values, researchers typically limit the situation of the nonzero values so that they fall right into a sure sample. However, every {hardware} accelerator is usually designed to assist one particular sparsity sample, limiting its flexibility.  

    By distinction, the {hardware} accelerator the MIT researchers designed, known as HighLight, can deal with all kinds of sparsity patterns and nonetheless carry out effectively when operating models that don’t have any zero values.

    They use a way they name “hierarchical structured sparsity” to efficiently symbolize all kinds of sparsity patterns which can be composed of a number of easy sparsity patterns. This method divides the values in a tensor into smaller blocks, the place every block has its personal easy, sparsity sample (maybe two zeros and two nonzeros in a block with 4 values).

    Then, they mix the blocks right into a hierarchy, the place every assortment of blocks additionally has its personal easy, sparsity sample (maybe one zero block and three nonzero blocks in a stage with 4 blocks). They proceed combining blocks into bigger ranges, however the patterns stay easy at every step.

    This simplicity permits HighLight to extra efficiently discover and skip zeros, so it may possibly take full benefit of the chance to chop extra computation. On common, their accelerator design had about six instances higher energy-delay product (a metric associated to power effectivity) than different approaches.

    “In the end, the HighLight accelerator is able to efficiently accelerate dense models because it does not introduce a lot of overhead, and at the same time it is able to exploit workloads with different amounts of zero values based on hierarchical structured sparsity,” Wu explains.

    In the longer term, she and her collaborators wish to apply hierarchical structured sparsity to extra forms of machine-learning models and several types of tensors within the models.

    Tailors and Swiftiles: Effectively “overbooking” to accelerate workloads

    Researchers may leverage sparsity to extra efficiently transfer and course of information on a pc chip.

    Since the tensors are sometimes bigger than what may be saved within the reminiscence buffer on chip, the chip solely grabs and processes a piece of the tensor at a time. The chunks are known as tiles.

    To maximize the utilization of that buffer and restrict the variety of instances the chip should entry off-chip reminiscence, which frequently dominates power consumption and limits processing velocity, researchers search to make use of the most important tile that can match into the buffer.

    But in a sparse tensor, lots of the information values are zero, so a fair bigger tile can match into the buffer than one may count on primarily based on its capability. Zero values don’t must be saved.

    But the variety of zero values can fluctuate throughout completely different areas of the tensor, to allow them to additionally fluctuate for every tile. This makes it troublesome to find out a tile dimension that can match within the buffer. As a consequence, current approaches typically conservatively assume there aren’t any zeros and find yourself deciding on a smaller tile, which ends up in wasted clean areas within the buffer.

    To deal with this uncertainty, the researchers suggest the usage of “overbooking” to permit them to extend the tile dimension, in addition to a method to tolerate it if the tile doesn’t match the buffer.

    The similar approach an airline overbooks tickets for a flight, if all of the passengers present up, the airline should compensate those who’re bumped from the airplane. But normally all of the passengers don’t present up.

    In a sparse tensor, a tile dimension may be chosen such that normally the tiles can have sufficient zeros that the majority nonetheless match into the buffer. But sometimes, a tile can have extra nonzero values than will match. In this case, these information are bumped out of the buffer.

    The researchers allow the {hardware} to solely re-fetch the bumped information with out grabbing and processing all the tile once more. They modify the “tail end” of the buffer to deal with this, therefore the title of this system, Tailors.

    Then in addition they created an method for discovering the scale for tiles that takes benefit of overbooking. This technique, known as Swiftiles, swiftly estimates the best tile dimension so {that a} particular proportion of tiles, set by the consumer, are overbooked. (The names “Tailors” and “Swiftiles” pay homage to Taylor Swift, whose latest Eras tour was fraught with overbooked presale codes for tickets).

    Swiftiles reduces the variety of instances the {hardware} must examine the tensor to establish a perfect tile dimension, saving on computation. The mixture of Tailors and Swiftiles greater than doubles the velocity whereas requiring solely half the power calls for of current {hardware} accelerators which can not deal with overbooking.

    “Swiftiles allows us to estimate how large these tiles need to be without requiring multiple iterations to refine the estimate. This only works because overbooking is supported. Even if you are off by a decent amount, you can still extract a fair bit of speedup because of the way the non-zeros are distributed,” Xue says.

    In the longer term, the researchers wish to apply the concept of overbooking to different facets in laptop structure and in addition work to enhance the method for estimating the optimum stage of overbooking.

    This analysis is funded, partly, by the MIT AI Hardware Program.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    Crypto Report Says ‘Alameda Gap’ Is Gone After Bitcoin Rally, What This Means

    In its most up-to-date analysis e-newsletter, crypto analysis agency Kaiko alluded to an ‘Alameda Gap,’…

    AI

    Meet STEVE-1: An Instructable Generative AI Model For Minecraft That Follows Both Text And Visual Instructions And Only Costs $60 To Train

    Powerful AI fashions could now be operated and interacted with through language instructions, making them…

    The Future

    SpaceX to send 5 uncrewed Starships to Mars in 2 years, claims Elon Musk

    SpaceX plans to launch about 5 uncrewed Starship missions to Mars in two years, CEO…

    The Future

    Sweater that mimics polar bear fur may keep you warm in extreme cold

    Polar bear fur retains the animals warm in Arctic temperaturesThorsten Milse/robertharding/Alamy A fibre that is…

    AI

    Marking a milestone: Dedication ceremony celebrates the new MIT Schwarzman College of Computing building | Ztoog

    The MIT Stephen A. Schwarzman College of Computing not too long ago marked a vital…

    Our Picks
    Mobile

    Ask Jerry: Why are new phones so difficult to set up?

    Gadgets

    The best impact drivers of 2023

    Technology

    Win Big Rewards Up to $10,000 USDT with Chimpzee NFT Passports – Here’s How You Can Join

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    AI

    Meta’s new AI model can translate speech from more than 100 languages

    Mobile

    Week 20 in review: HTC U23 Pro official, Xperia Pro’s cameras detailed

    Science

    UnitedHealth uses AI model with 90% error rate to deny care, lawsuit alleges

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.