Close Menu
Ztoog
    What's Hot
    Mobile

    Google Maps is now bringing 3D buildings during navigation to more Android Auto users

    Technology

    Dishwasher Danger List: 11 Kitchen Items That Should Always Be Washed by Hand

    Technology

    Teen’s Pill-Tracking Device Attracts Interest From CVS Pharmacy

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » Combining next-token prediction and video diffusion in computer vision and robotics | Ztoog
    AI

    Combining next-token prediction and video diffusion in computer vision and robotics | Ztoog

    Facebook Twitter Pinterest WhatsApp
    Combining next-token prediction and video diffusion in computer vision and robotics | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    In the present AI zeitgeist, sequence fashions have skyrocketed in reputation for his or her capacity to research information and predict what to do subsequent. For occasion, you’ve seemingly used next-token prediction fashions like ChatGPT, which anticipate every phrase (token) in a sequence to type solutions to customers’ queries. There are additionally full-sequence diffusion fashions like Sora, which convert phrases into dazzling, reasonable visuals by successively “denoising” a whole video sequence. 

    Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have proposed a easy change to the diffusion coaching scheme that makes this sequence denoising significantly extra versatile.

    When utilized to fields like computer vision and robotics, the next-token and full-sequence diffusion fashions have functionality trade-offs. Next-token fashions can spit out sequences that modify in size. However, they make these generations whereas being unaware of fascinating states in the far future — reminiscent of steering its sequence technology towards a sure objective 10 tokens away — and thus require extra mechanisms for long-horizon (long-term) planning. Diffusion fashions can carry out such future-conditioned sampling, however lack the flexibility of next-token fashions to generate variable-length sequences.

    Researchers from CSAIL wish to mix the strengths of each fashions, so that they created a sequence mannequin coaching method known as “Diffusion Forcing.” The title comes from “Teacher Forcing,” the standard coaching scheme that breaks down full sequence technology into the smaller, simpler steps of next-token technology (very similar to an excellent trainer simplifying a fancy idea).

    Diffusion Forcing discovered widespread floor between diffusion fashions and trainer forcing: They each use coaching schemes that contain predicting masked (noisy) tokens from unmasked ones. In the case of diffusion fashions, they progressively add noise to information, which could be considered as fractional masking. The MIT researchers’ Diffusion Forcing technique trains neural networks to cleanse a group of tokens, eradicating completely different quantities of noise inside each whereas concurrently predicting the subsequent few tokens. The outcome: a versatile, dependable sequence mannequin that resulted in higher-quality synthetic movies and extra exact decision-making for robots and AI brokers.

    By sorting by means of noisy information and reliably predicting the subsequent steps in a activity, Diffusion Forcing can help a robotic in ignoring visible distractions to finish manipulation duties. It can even generate secure and constant video sequences and even information an AI agent by means of digital mazes. This technique might doubtlessly allow family and manufacturing facility robots to generalize to new duties and enhance AI-generated leisure.

    “Sequence models aim to condition on the known past and predict the unknown future, a type of binary masking. However, masking doesn’t need to be binary,” says lead writer, MIT electrical engineering and computer science (EECS) PhD scholar, and CSAIL member Boyuan Chen. “With Diffusion Forcing, we add different levels of noise to each token, effectively serving as a type of fractional masking. At test time, our system can “unmask” a group of tokens and diffuse a sequence in the close to future at a decrease noise degree. It is aware of what to belief inside its information to beat out-of-distribution inputs.”

    In a number of experiments, Diffusion Forcing thrived at ignoring deceptive information to execute duties whereas anticipating future actions.

    When applied right into a robotic arm, for instance, it helped swap two toy fruits throughout three round mats, a minimal instance of a household of long-horizon duties that require reminiscences. The researchers educated the robotic by controlling it from a distance (or teleoperating it) in digital actuality. The robotic is educated to imitate the person’s actions from its digicam. Despite ranging from random positions and seeing distractions like a procuring bag blocking the markers, it positioned the objects into its goal spots.

    To generate movies, they educated Diffusion Forcing on “Minecraft” sport play and colourful digital environments created inside Google’s DeepMind Lab Simulator. When given a single body of footage, the tactic produced extra secure, higher-resolution movies than comparable baselines like a Sora-like full-sequence diffusion mannequin and ChatGPT-like next-token fashions. These approaches created movies that appeared inconsistent, with the latter generally failing to generate working video previous simply 72 frames.

    Diffusion Forcing not solely generates fancy movies, however can even function a movement planner that steers towards desired outcomes or rewards. Thanks to its flexibility, Diffusion Forcing can uniquely generate plans with various horizon, carry out tree search, and incorporate the instinct that the distant future is extra unsure than the close to future. In the duty of fixing a 2D maze, Diffusion Forcing outperformed six baselines by producing sooner plans resulting in the objective location, indicating that it might be an efficient planner for robots in the longer term.

    Across every demo, Diffusion Forcing acted as a full sequence mannequin, a next-token prediction mannequin, or each. According to Chen, this versatile strategy might doubtlessly function a strong spine for a “world model,” an AI system that may simulate the dynamics of the world by coaching on billions of web movies. This would enable robots to carry out novel duties by imagining what they should do primarily based on their environment. For instance, when you requested a robotic to open a door with out being educated on find out how to do it, the mannequin might produce a video that’ll present the machine find out how to do it.

    The workforce is presently seeking to scale up their technique to bigger datasets and the most recent transformer fashions to enhance efficiency. They intend to broaden their work to construct a ChatGPT-like robotic mind that helps robots carry out duties in new environments with out human demonstration.

    “With Diffusion Forcing, we are taking a step to bringing video generation and robotics closer together,” says senior writer Vincent Sitzmann, MIT assistant professor and member of CSAIL, the place he leads the Scene Representation group. “In the end, we hope that we can use all the knowledge stored in videos on the internet to enable robots to help in everyday life. Many more exciting research challenges remain, like how robots can learn to imitate humans by watching them even when their own bodies are so different from our own!”

    Chen and Sitzmann wrote the paper alongside latest MIT visiting researcher Diego Martí Monsó, and CSAIL associates: Yilun Du, a EECS graduate scholar; Max Simchowitz, former postdoc and incoming Carnegie Mellon University assistant professor; and Russ Tedrake, the Toyota Professor of EECS, Aeronautics and Astronautics, and Mechanical Engineering at MIT, vp of robotics analysis on the Toyota Research Institute, and CSAIL member. Their work was supported, in half, by the U.S. National Science Foundation, the Singapore Defence Science and Technology Agency, Intelligence Advanced Research Projects Activity by way of the U.S. Department of the Interior, and the Amazon Science Hub. They will current their analysis at NeurIPS in December.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Mobile

    News Weekly: OnePlus 12 global launch, Pixel Feature Drop, and more

    This is Android Central’s News Weekly, your go-to supply for a concise roundup of the…

    Mobile

    Apple celebrates the New Year in Japan with free gift card promo and engraved AirTag trackers

    Apple is at the moment working a gift card promotion in Japan whereas additionally providing…

    AI

    LLMs & Knowledge Graphs – MarkTechPost

    Large Language Models (LLMs) are AI instruments that may perceive and generate human language. They…

    The Future

    Top 7 employee productivity tracking software

    To save time and enhance effectivity throughout your crew, you may need tried implementing to-do…

    Science

    The Secret of How Cells Make ‘Dark Oxygen’ Without Light

    The unique model of this story appeared in Quanta Magazine.Scientists have come to appreciate that…

    Our Picks
    Gadgets

    SanDisk Extreme SSDs are “worthless,” multiple lawsuits against WD say

    Science

    Water may be forming on the moon thanks to Earth’s magnetic field

    The Future

    James Gunn’s DC Studios Supergirl Movie Casts Villain

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    The Future

    Internet Connection Types Explained – CNET

    Crypto

    Bitcoin Hashrate And Difficulty Reach New All-Time Highs, What This Means

    Science

    A star has been eating an orbiting planet for 85 years

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.