Close Menu
Ztoog
    What's Hot
    Mobile

    Spotify enables patron-exclusive podcasts via new Patreon integration

    AI

    Apple is Planning a Revolutionary AI Leap: In Talks to Integrate Google’s Gemini Engine into iPhones

    AI

    EASYTOOL: An Artificial Intelligence Framework Transforming Diverse and Lengthy Tool Documentation into a Unified and Concise Tool Instruction for Easier Tool Usage

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving
    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Facebook Twitter Pinterest WhatsApp
    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Serving Large Language Models (LLMs) at scale is an enormous engineering problem due to Key-Value (KV) cache administration. As fashions develop in measurement and reasoning functionality, the KV cache footprint will increase and turns into a significant bottleneck for throughput and latency. For trendy Transformers, this cache can occupy a number of gigabytes.

    NVIDIA researchers have launched KVTC (KV Cache Transform Coding). This light-weight remodel coder compresses KV caches for compact on-GPU and off-GPU storage. It achieves up to 20x compression whereas sustaining reasoning and long-context accuracy. For particular use instances, it might probably attain 40x or increased.

    https://arxiv.org/pdf/2511.01815

    The Memory Dilemma in LLM Inference

    In manufacturing, inference frameworks deal with native KV caches like databases. Strategies like prefix sharing promote the reuse of caches to pace up responses. However, stale caches eat scarce GPU reminiscence. Developers at the moment face a troublesome alternative:

    • Keep the cache: Occupies reminiscence wanted for different customers.
    • Discard the cache: Incurs the excessive value of recomputation.
    • Offload the cache: Moves knowledge to CPU DRAM or SSDs, main to switch overheads.

    KVTC largely mitigates this dilemma by decreasing the price of on-chip retention and decreasing the bandwidth required for offloading.

    https://arxiv.org/pdf/2511.01815

    How the KVTC Pipeline Works?

    The methodology is impressed by classical media compression. It applies a realized orthonormal remodel, adopted by adaptive quantization and entropy coding.

    1. Feature Decorrelation (PCA)

    Different consideration heads usually present related patterns and a excessive diploma of correlation. KVTC makes use of Principal Component Analysis (PCA) to linearly decorrelate options. Unlike different strategies that calculate a separate decomposition for each immediate, KVTC computes the PCA foundation matrix V as soon as on a calibration dataset. This matrix is then reused for all future caches at inference time.

    2. Adaptive Quantization

    The system exploits the PCA ordering to allocate a hard and fast bit finances throughout coordinates. High-variance parts obtain extra bits, whereas others obtain fewer. KVTC makes use of a dynamic programming (DP) algorithm to discover the optimum bit allocation that minimizes reconstruction error. Crucially, the DP usually assigns 0 bits to trailing principal parts, permitting for early dimensionality discount and sooner efficiency.

    3. Entropy Coding

    The quantized symbols are packed and compressed utilizing the DEFLATE algorithm. To preserve pace, KVTC leverages the nvCOMP library, which allows parallel compression and decompression straight on the GPU.

    Protecting Critical Tokens

    Not all tokens are compressed equally. KVTC avoids compressing two particular sorts of tokens as a result of they contribute disproportionately to consideration accuracy:

    • Attention Sinks: The 4 oldest tokens within the sequence.
    • Sliding Window: The 128 most up-to-date tokens.

    Ablation research present that compressing these particular tokens can considerably decrease and even collapse accuracy at excessive compression ratios.

    Benchmarks and Efficiency

    The analysis staff examined KVTC with fashions like Llama-3.1, Mistral-NeMo, and R1-Qwen-2.5.

    • Accuracy: At 16x compression (roughly 20x after DEFLATE), the mannequin constantly maintains outcomes inside 1 rating level of vanilla fashions.
    • TTFT Reduction: For an 8K context size, kvtc can cut back Time-To-First-Token (TTFT) by up to 8x in contrast to full recomputation.
    • Speed: Calibration is quick; for a 12B mannequin, it may be accomplished inside 10 minutes on an NVIDIA H100 GPU.
    • Storage Overhead: The further knowledge saved per mannequin is small, representing solely 2.4% of mannequin parameters for Llama-3.3-70B.

    KVTC is a sensible constructing block for memory-efficient LLM serving. It doesn’t modify mannequin weights and is straight appropriate with different token eviction strategies.

    https://arxiv.org/pdf/2511.01815

    Key Takeaways

    • High Compression with Low Accuracy Loss: KVTC achieves a typical 20x compression ratio whereas sustaining outcomes inside 1 rating level of vanilla (uncompressed) fashions throughout most reasoning and long-context benchmarks.
    • Transform Coding Pipeline: The methodology makes use of a pipeline impressed by classical media compression, combining PCA-based function decorrelation, adaptive quantization by way of dynamic programming, and lossless entropy coding (DEFLATE).
    • Critical Token Protection: To preserve mannequin efficiency, KVTC avoids compressing the 4 oldest ‘consideration sink’ tokens and a ‘sliding window’ of the 128 most up-to-date tokens.
    • Operational Efficiency: The system is ‘tuning-free,’ requiring solely a short preliminary calibration (below 10 minutes for a 12B mannequin) that leaves mannequin parameters unchanged and provides minimal storage overhead—solely 2.4% for a 70B mannequin.
    • Significant Latency Reduction: By decreasing the quantity of knowledge saved and transferred, KVTC can cut back Time-To-First-Token (TTFT) by up to 8x in contrast to the total recomputation of KV caches for lengthy contexts.

    Check out the Paper right here. Also, be happy to observe us on Twitter and don’t neglect to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you’ll be able to be part of us on telegram as effectively.

    The put up NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving appeared first on MarkTechPost.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    Crypto

    Build a pipeline and close deals with an exhibit table at Disrupt 2026

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    AI

    Using reinforcement learning for dynamic planning in open-ended conversations – Ztoog

    Posted by Deborah Cohen, Staff Research Scientist, and Craig Boutilier, Principal Scientist, Google Research

    The Future

    Liquid Glass, New Photos App and All the Other iOS 26 Features Coming to Your iPhone

    Your iPhone will quickly look lots completely different when iOS 26 (named for 2026) comes out in…

    Mobile

    The best Android tablet of 2023?

    TL;DR Samsung Galaxy Tab S9 Ultra renders have leaked on-line. The photographs present a tablet…

    AI

    What Should You Choose Between Retrieval Augmented Generation (RAG) And Fine-Tuning?

    Recent months have seen a big rise within the recognition of Large Language Models (LLMs).…

    Gadgets

    Snag this portable color sensor for just $60 this St. Patrick’s Day

    We might earn income from the merchandise out there on this web page and take…

    Our Picks
    Gadgets

    How to Back Up Your Emails in Gmail, Outlook, and iCloud

    AI

    Introducing MIT Technology Review Roundtables, real-time conversations about what’s next in tech

    Mobile

    OnePlus Open: Specs, rumors, release date, price, and more

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    Technology

    A Wearable Robotic Assistant That’s All Over You

    Gadgets

    Scootility: Innovative Electric Scooter For Last-Mile Cargo Delivery

    AI

    Stanford Researchers Introduce SequenceMatch: Training LLMs With An Imitation Learning Loss

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.