Close Menu
Ztoog
    What's Hot
    AI

    This new data poisoning tool lets artists fight back against generative AI

    Mobile

    Despite a less than fabulous year for Apple, Tim Cook gets a pay hike

    AI

    Meet MobileVLM: A Competent Multimodal Vision Language Model (MMVLM) Targeted to Run on Mobile Devices

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      OPPO launches A5 Pro 5G: Premium features at a budget price

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

    • Technology

      What It Is and Why It Matters—Part 1 – O’Reilly

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Motorola’s Moto Watch needs to start living up to the brand name

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

    • Science

      Nothing is stronger than quantum connections – and now we know why

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

    • AI

      Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

    • Crypto

      Ethereum Breaks Key Resistance In One Massive Move – Higher High Confirms Momentum

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

    Ztoog
    Home » Combining next-token prediction and video diffusion in computer vision and robotics | Ztoog
    AI

    Combining next-token prediction and video diffusion in computer vision and robotics | Ztoog

    Facebook Twitter Pinterest WhatsApp
    Combining next-token prediction and video diffusion in computer vision and robotics | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    In the present AI zeitgeist, sequence fashions have skyrocketed in reputation for his or her capacity to research information and predict what to do subsequent. For occasion, you’ve seemingly used next-token prediction fashions like ChatGPT, which anticipate every phrase (token) in a sequence to type solutions to customers’ queries. There are additionally full-sequence diffusion fashions like Sora, which convert phrases into dazzling, reasonable visuals by successively “denoising” a whole video sequence. 

    Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have proposed a easy change to the diffusion coaching scheme that makes this sequence denoising significantly extra versatile.

    When utilized to fields like computer vision and robotics, the next-token and full-sequence diffusion fashions have functionality trade-offs. Next-token fashions can spit out sequences that modify in size. However, they make these generations whereas being unaware of fascinating states in the far future — reminiscent of steering its sequence technology towards a sure objective 10 tokens away — and thus require extra mechanisms for long-horizon (long-term) planning. Diffusion fashions can carry out such future-conditioned sampling, however lack the flexibility of next-token fashions to generate variable-length sequences.

    Researchers from CSAIL wish to mix the strengths of each fashions, so that they created a sequence mannequin coaching method known as “Diffusion Forcing.” The title comes from “Teacher Forcing,” the standard coaching scheme that breaks down full sequence technology into the smaller, simpler steps of next-token technology (very similar to an excellent trainer simplifying a fancy idea).

    Diffusion Forcing discovered widespread floor between diffusion fashions and trainer forcing: They each use coaching schemes that contain predicting masked (noisy) tokens from unmasked ones. In the case of diffusion fashions, they progressively add noise to information, which could be considered as fractional masking. The MIT researchers’ Diffusion Forcing technique trains neural networks to cleanse a group of tokens, eradicating completely different quantities of noise inside each whereas concurrently predicting the subsequent few tokens. The outcome: a versatile, dependable sequence mannequin that resulted in higher-quality synthetic movies and extra exact decision-making for robots and AI brokers.

    By sorting by means of noisy information and reliably predicting the subsequent steps in a activity, Diffusion Forcing can help a robotic in ignoring visible distractions to finish manipulation duties. It can even generate secure and constant video sequences and even information an AI agent by means of digital mazes. This technique might doubtlessly allow family and manufacturing facility robots to generalize to new duties and enhance AI-generated leisure.

    “Sequence models aim to condition on the known past and predict the unknown future, a type of binary masking. However, masking doesn’t need to be binary,” says lead writer, MIT electrical engineering and computer science (EECS) PhD scholar, and CSAIL member Boyuan Chen. “With Diffusion Forcing, we add different levels of noise to each token, effectively serving as a type of fractional masking. At test time, our system can “unmask” a group of tokens and diffuse a sequence in the close to future at a decrease noise degree. It is aware of what to belief inside its information to beat out-of-distribution inputs.”

    In a number of experiments, Diffusion Forcing thrived at ignoring deceptive information to execute duties whereas anticipating future actions.

    When applied right into a robotic arm, for instance, it helped swap two toy fruits throughout three round mats, a minimal instance of a household of long-horizon duties that require reminiscences. The researchers educated the robotic by controlling it from a distance (or teleoperating it) in digital actuality. The robotic is educated to imitate the person’s actions from its digicam. Despite ranging from random positions and seeing distractions like a procuring bag blocking the markers, it positioned the objects into its goal spots.

    To generate movies, they educated Diffusion Forcing on “Minecraft” sport play and colourful digital environments created inside Google’s DeepMind Lab Simulator. When given a single body of footage, the tactic produced extra secure, higher-resolution movies than comparable baselines like a Sora-like full-sequence diffusion mannequin and ChatGPT-like next-token fashions. These approaches created movies that appeared inconsistent, with the latter generally failing to generate working video previous simply 72 frames.

    Diffusion Forcing not solely generates fancy movies, however can even function a movement planner that steers towards desired outcomes or rewards. Thanks to its flexibility, Diffusion Forcing can uniquely generate plans with various horizon, carry out tree search, and incorporate the instinct that the distant future is extra unsure than the close to future. In the duty of fixing a 2D maze, Diffusion Forcing outperformed six baselines by producing sooner plans resulting in the objective location, indicating that it might be an efficient planner for robots in the longer term.

    Across every demo, Diffusion Forcing acted as a full sequence mannequin, a next-token prediction mannequin, or each. According to Chen, this versatile strategy might doubtlessly function a strong spine for a “world model,” an AI system that may simulate the dynamics of the world by coaching on billions of web movies. This would enable robots to carry out novel duties by imagining what they should do primarily based on their environment. For instance, when you requested a robotic to open a door with out being educated on find out how to do it, the mannequin might produce a video that’ll present the machine find out how to do it.

    The workforce is presently seeking to scale up their technique to bigger datasets and the most recent transformer fashions to enhance efficiency. They intend to broaden their work to construct a ChatGPT-like robotic mind that helps robots carry out duties in new environments with out human demonstration.

    “With Diffusion Forcing, we are taking a step to bringing video generation and robotics closer together,” says senior writer Vincent Sitzmann, MIT assistant professor and member of CSAIL, the place he leads the Scene Representation group. “In the end, we hope that we can use all the knowledge stored in videos on the internet to enable robots to help in everyday life. Many more exciting research challenges remain, like how robots can learn to imitate humans by watching them even when their own bodies are so different from our own!”

    Chen and Sitzmann wrote the paper alongside latest MIT visiting researcher Diego Martí Monsó, and CSAIL associates: Yilun Du, a EECS graduate scholar; Max Simchowitz, former postdoc and incoming Carnegie Mellon University assistant professor; and Russ Tedrake, the Toyota Professor of EECS, Aeronautics and Astronautics, and Mechanical Engineering at MIT, vp of robotics analysis on the Toyota Research Institute, and CSAIL member. Their work was supported, in half, by the U.S. National Science Foundation, the Singapore Defence Science and Technology Agency, Intelligence Advanced Research Projects Activity by way of the U.S. Department of the Interior, and the Amazon Science Hub. They will current their analysis at NeurIPS in December.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    Crypto

    Speak at Ztoog Disrupt 2025: Applications now open

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    The Future

    What To Do When Your Apple Pencil Not Working?

    The first Apple Pencil launched with the unique iPad Pro in 2015, however now there…

    Crypto

    Solend taps Sui for new DeFi lending protocol, Suilend

    Share this text The group behind Solend, a prime lending platform on the Solana blockchain,…

    Mobile

    Highly-spec’d RedMagic 9 Pro series unveiled in China with U.S. version to be introduced next month

    The RedMagic 9 Pro and RedMagic 9 Pro+ gaming telephones have been unveiled immediately in…

    Technology

    Celebrating the Life of Columbia Professor Stephen Unger

    Stephen Unger, founder and previous president of the IEEE Society on Social Implications of Technology,…

    AI

    President Sally Kornbluth and OpenAI CEO Sam Altman discuss the future of AI | Ztoog

    How is the area of synthetic intelligence evolving and what does it imply for the…

    Our Picks
    Science

    A Major Alarm Is Flashing Under Greenland’s Ice

    Technology

    Was Windows Vista Really That Bad—or Just Misunderstood?

    Technology

    Jon Kabat-Zinn on how mindfulness meditation went mainstream

    Categories
    • AI (1,483)
    • Crypto (1,745)
    • Gadgets (1,796)
    • Mobile (1,840)
    • Science (1,854)
    • Technology (1,790)
    • The Future (1,636)
    Most Popular
    Crypto

    Why Is Bitcoin SV (BSV) Up 63% Today

    The Future

    Hasbro Reveals Ahsoka-Inspired Clone Trooper Figure Packs

    Science

    Rocket Report: Starship heats up in third flight; Chinese lunar launch failure

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.