Close Menu
Ztoog
    What's Hot
    Gadgets

    The best full-suspension e-bikes for 2024

    Crypto

    Here’s How Long The Majority Of New Ethereum Wallets Are Used Before They’re Dumped

    Technology

    Machina Labs Robot Blacksmith Making Spaceship Tanks for NASA

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Common Security Mistakes Made By Businesses and How to Avoid Them

      What time tracking metrics should you track and why?

      Are entangled qubits following a quantum Moore’s law?

      Disneyland’s 70th Anniversary Brings Cartoony Chaos to This Summer’s Celebration

      Story of military airfield in Afghanistan that Biden left in 2021

    • Technology

      How To Come Back After A Layoff

      Are Democrats fumbling a golden opportunity?

      Crypto elite increasingly worried about their personal safety

      Deep dive on the evolution of Microsoft's relationship with OpenAI, from its $1B investment in 2019 through Copilot rollouts and ChatGPT's launch to present day (Bloomberg)

      New leak reveals iPhone Fold won’t look like the Galaxy Z Fold 6 at all

    • Gadgets

      Google shows off Android XR-based glasses, announces Warby Parker team-up

      The market’s down, but this OpenAI for the stock market can help you trade up

      We Hand-Picked the 24 Best Deals From the 2025 REI Anniversary Sale

      “Google wanted that”: Nextcloud decries Android permissions as “gatekeeping”

      Google Tests Automatic Password-to-Passkey Conversion On Android

    • Mobile

      Forget screens: more details emerge on the mysterious Jony Ive + OpenAI device

      Android 16 QPR1 lets you check what fingerprints you’ve enrolled on your Pixel phone

      The Forerunner 570 & 970 have made Garmin’s tiered strategy clearer than ever

      The iPhone Fold is now being tested with an under-display camera

      T-Mobile takes over one of golf’s biggest events, unleashes unique experiences

    • Science

      AI Is Eating Data Center Power Demand—and It’s Only Getting Worse

      Liquid physics: Inside the lab making black hole analogues on Earth

      Risk of a star destroying the solar system is higher than expected

      Do these Buddhist gods hint at the purpose of China’s super-secret satellites?

      From Espresso to Eco-Brick: How Coffee Waste Fuels 3D-Printed Design

    • AI

      AI learns how vision and sound are connected, without human intervention | Ztoog

      How AI is introducing errors into courtrooms

      With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

      Google DeepMind’s new AI agent cracks real-world problems better than humans can

      Study shows vision-language models can’t handle queries with negation words | Ztoog

    • Crypto

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

      Is Bitcoin Bull Run Back? Daily RSI Shows Only Mild Bullish Momentum

      Robinhood grows its footprint in Canada by acquiring WonderFi

      HashKey Group Announces Launch of HashKey Global MENA with VASP License in UAE

    Ztoog
    Home » AI learns how vision and sound are connected, without human intervention | Ztoog
    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    Facebook Twitter Pinterest WhatsApp
    AI learns how vision and sound are connected, without human intervention | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Humans naturally be taught by making connections between sight and sound. For occasion, we will watch somebody taking part in the cello and acknowledge that the cellist’s actions are producing the music we hear.

    A brand new method developed by researchers from MIT and elsewhere improves an AI mannequin’s means to be taught on this identical trend. This might be helpful in purposes corresponding to journalism and movie manufacturing, the place the mannequin might assist with curating multimodal content material by means of automated video and audio retrieval.

    In the long term, this work might be used to enhance a robotic’s means to grasp real-world environments, the place auditory and visible data are usually carefully linked.

    Improving upon prior work from their group, the researchers created a way that helps machine-learning fashions align corresponding audio and visible information from video clips without the necessity for human labels.

    They adjusted how their unique mannequin is educated so it learns a finer-grained correspondence between a selected video body and the audio that happens in that second. The researchers additionally made some architectural tweaks that assist the system steadiness two distinct studying goals, which improves efficiency.

    Taken collectively, these comparatively easy enhancements enhance the accuracy of their method in video retrieval duties and in classifying the motion in audiovisual scenes. For occasion, the brand new technique might routinely and exactly match the sound of a door slamming with the visible of it closing in a video clip.

    “We are building AI systems that can process the world like humans do, in terms of having both audio and visual information coming in at once and being able to seamlessly process both modalities. Looking forward, if we can integrate this audio-visual technology into some of the tools we use on a daily basis, like large language models, it could open up a lot of new applications,” says Andrew Rouditchenko, an MIT graduate scholar and co-author of a paper on this analysis.

    He is joined on the paper by lead writer Edson Araujo, a graduate scholar at Goethe University in Germany; Yuan Gong, a former MIT postdoc; Saurabhchand Bhati, a present MIT postdoc; Samuel Thomas, Brian Kingsbury, and Leonid Karlinsky of IBM Research; Rogerio Feris, principal scientist and supervisor on the MIT-IBM Watson AI Lab; James Glass, senior analysis scientist and head of the Spoken Language Systems Group within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior writer Hilde Kuehne, professor of pc science at Goethe University and an affiliated professor on the MIT-IBM Watson AI Lab. The work might be introduced on the Conference on Computer Vision and Pattern Recognition.

    Syncing up

    This work builds upon a machine-learning technique the researchers developed just a few years in the past, which supplied an environment friendly option to practice a multimodal mannequin to concurrently course of audio and visible information without the necessity for human labels.

    The researchers feed this mannequin, referred to as CAV-MAE, unlabeled video clips and it encodes the visible and audio information individually into representations referred to as tokens. Using the pure audio from the recording, the mannequin routinely learns to map corresponding pairs of audio and visible tokens shut collectively inside its inside illustration area.

    They discovered that utilizing two studying goals balances the mannequin’s studying course of, which permits CAV-MAE to grasp the corresponding audio and visible information whereas bettering its means to get well video clips that match person queries.

    But CAV-MAE treats audio and visible samples as one unit, so a 10-second video clip and the sound of a door slamming are mapped collectively, even when that audio occasion occurs in only one second of the video.

    In their improved mannequin, referred to as CAV-MAE Sync, the researchers break up the audio into smaller home windows earlier than the mannequin computes its representations of the info, so it generates separate representations that correspond to every smaller window of audio.

    During coaching, the mannequin learns to affiliate one video body with the audio that happens throughout simply that body.

    “By doing that, the model learns a finer-grained correspondence, which helps with performance later when we aggregate this information,” Araujo says.

    They additionally included architectural enhancements that assist the mannequin steadiness its two studying goals.

    Adding “wiggle room”

    The mannequin incorporates a contrastive goal, the place it learns to affiliate comparable audio and visible information, and a reconstruction goal which goals to get well particular audio and visible information primarily based on person queries.

    In CAV-MAE Sync, the researchers launched two new kinds of information representations, or tokens, to enhance the mannequin’s studying means.

    They embrace devoted “global tokens” that assist with the contrastive studying goal and devoted “register tokens” that assist the mannequin deal with essential particulars for the reconstruction goal.

    “Essentially, we add a bit more wiggle room to the model so it can perform each of these two tasks, contrastive and reconstructive, a bit more independently. That benefitted overall performance,” Araujo provides.

    While the researchers had some instinct these enhancements would enhance the efficiency of CAV-MAE Sync, it took a cautious mixture of methods to shift the mannequin within the route they needed it to go.

    “Because we have multiple modalities, we need a good model for both modalities by themselves, but we also need to get them to fuse together and collaborate,” Rouditchenko says.

    In the top, their enhancements improved the mannequin’s means to retrieve movies primarily based on an audio question and predict the category of an audio-visual scene, like a canine barking or an instrument taking part in.

    Its outcomes have been extra correct than their prior work, and it additionally carried out higher than extra advanced, state-of-the-art strategies that require bigger quantities of coaching information.

    “Sometimes, very simple ideas or little patterns you see in the data have big value when applied on top of a model you are working on,” Araujo says.

    In the longer term, the researchers wish to incorporate new fashions that generate higher information representations into CAV-MAE Sync, which might enhance efficiency. They additionally wish to allow their system to deal with textual content information, which might be an essential step towards producing an audiovisual massive language mannequin.

    This work is funded, partially, by the German Federal Ministry of Education and Research and the MIT-IBM Watson AI Lab.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    AI

    Study shows vision-language models can’t handle queries with negation words | Ztoog

    AI

    How a new type of AI is helping police skirt facial recognition bans

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Technology

    Robots Get a Fleshy Face (and a Smile) in New Research

    Engineers in Japan try to get robots to mimic that significantly human expression — the…

    AI

    Successfully deploying machine learning | MIT Technology Review

    The following are the report’s key findings: Businesses purchase into AI/ML, however battle to scale…

    Crypto

    Reality Check: MATIC Investors Count Losses

    Polygon’s native token, MATIC, finds itself at a crossroads because it grapples with worth volatility…

    Technology

    Intel CEO claims 18A node will at least match TSMC’s N2 performance and beat it to market

    Shots fired: As semiconductor producers solidify their 3nm processes and intensify the race towards 2nm,…

    Science

    Europe seeks to emulate NASA’s revolutionary commercial cargo program

    Enlarge / A rendering of the European cargo reentry automobile proposed by Thales Alenia Space.Thales…

    Our Picks
    Mobile

    Weekly deals roundup: Check out these sweet discounts on the Galaxy S23 Ultra, Galaxy A54, and more

    Mobile

    Samsung Galaxy A05 and A05s go official in Malaysia

    AI

    A Team of UC Berkeley and Stanford Researchers Introduce S-LoRA: An Artificial Intelligence System Designed for the Scalable Serving of Many LoRA Adapters

    Categories
    • AI (1,489)
    • Crypto (1,750)
    • Gadgets (1,801)
    • Mobile (1,846)
    • Science (1,861)
    • Technology (1,797)
    • The Future (1,643)
    Most Popular
    Technology

    Enhance your calm: Demolition Man turns 30

    Science

    Jupiter’s Great Red Spot may have disappeared and reformed

    Mobile

    Save $200 on an Apple MacBook Air, the best laptop for most

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.