Close Menu
Ztoog
    What's Hot
    Mobile

    Samsung says ChatGPT-like AI smarts could come to Bixby –

    Mobile

    The Galaxy Z Flip 5 has me almost ready for a foldable

    AI

    Brain surgery training from an avatar | Ztoog

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Common Security Mistakes Made By Businesses and How to Avoid Them

      What time tracking metrics should you track and why?

      Are entangled qubits following a quantum Moore’s law?

      Disneyland’s 70th Anniversary Brings Cartoony Chaos to This Summer’s Celebration

      Story of military airfield in Afghanistan that Biden left in 2021

    • Technology

      How To Come Back After A Layoff

      Are Democrats fumbling a golden opportunity?

      Crypto elite increasingly worried about their personal safety

      Deep dive on the evolution of Microsoft's relationship with OpenAI, from its $1B investment in 2019 through Copilot rollouts and ChatGPT's launch to present day (Bloomberg)

      New leak reveals iPhone Fold won’t look like the Galaxy Z Fold 6 at all

    • Gadgets

      Google shows off Android XR-based glasses, announces Warby Parker team-up

      The market’s down, but this OpenAI for the stock market can help you trade up

      We Hand-Picked the 24 Best Deals From the 2025 REI Anniversary Sale

      “Google wanted that”: Nextcloud decries Android permissions as “gatekeeping”

      Google Tests Automatic Password-to-Passkey Conversion On Android

    • Mobile

      Forget screens: more details emerge on the mysterious Jony Ive + OpenAI device

      Android 16 QPR1 lets you check what fingerprints you’ve enrolled on your Pixel phone

      The Forerunner 570 & 970 have made Garmin’s tiered strategy clearer than ever

      The iPhone Fold is now being tested with an under-display camera

      T-Mobile takes over one of golf’s biggest events, unleashes unique experiences

    • Science

      AI Is Eating Data Center Power Demand—and It’s Only Getting Worse

      Liquid physics: Inside the lab making black hole analogues on Earth

      Risk of a star destroying the solar system is higher than expected

      Do these Buddhist gods hint at the purpose of China’s super-secret satellites?

      From Espresso to Eco-Brick: How Coffee Waste Fuels 3D-Printed Design

    • AI

      AI learns how vision and sound are connected, without human intervention | Ztoog

      How AI is introducing errors into courtrooms

      With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

      Google DeepMind’s new AI agent cracks real-world problems better than humans can

      Study shows vision-language models can’t handle queries with negation words | Ztoog

    • Crypto

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

      Is Bitcoin Bull Run Back? Daily RSI Shows Only Mild Bullish Momentum

      Robinhood grows its footprint in Canada by acquiring WonderFi

      HashKey Group Announces Launch of HashKey Global MENA with VASP License in UAE

    Ztoog
    Home » AI learns how vision and sound are connected, without human intervention | Ztoog
    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    Facebook Twitter Pinterest WhatsApp
    AI learns how vision and sound are connected, without human intervention | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Humans naturally be taught by making connections between sight and sound. For occasion, we will watch somebody taking part in the cello and acknowledge that the cellist’s actions are producing the music we hear.

    A brand new method developed by researchers from MIT and elsewhere improves an AI mannequin’s means to be taught on this identical trend. This might be helpful in purposes corresponding to journalism and movie manufacturing, the place the mannequin might assist with curating multimodal content material by means of automated video and audio retrieval.

    In the long term, this work might be used to enhance a robotic’s means to grasp real-world environments, the place auditory and visible data are usually carefully linked.

    Improving upon prior work from their group, the researchers created a way that helps machine-learning fashions align corresponding audio and visible information from video clips without the necessity for human labels.

    They adjusted how their unique mannequin is educated so it learns a finer-grained correspondence between a selected video body and the audio that happens in that second. The researchers additionally made some architectural tweaks that assist the system steadiness two distinct studying goals, which improves efficiency.

    Taken collectively, these comparatively easy enhancements enhance the accuracy of their method in video retrieval duties and in classifying the motion in audiovisual scenes. For occasion, the brand new technique might routinely and exactly match the sound of a door slamming with the visible of it closing in a video clip.

    “We are building AI systems that can process the world like humans do, in terms of having both audio and visual information coming in at once and being able to seamlessly process both modalities. Looking forward, if we can integrate this audio-visual technology into some of the tools we use on a daily basis, like large language models, it could open up a lot of new applications,” says Andrew Rouditchenko, an MIT graduate scholar and co-author of a paper on this analysis.

    He is joined on the paper by lead writer Edson Araujo, a graduate scholar at Goethe University in Germany; Yuan Gong, a former MIT postdoc; Saurabhchand Bhati, a present MIT postdoc; Samuel Thomas, Brian Kingsbury, and Leonid Karlinsky of IBM Research; Rogerio Feris, principal scientist and supervisor on the MIT-IBM Watson AI Lab; James Glass, senior analysis scientist and head of the Spoken Language Systems Group within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior writer Hilde Kuehne, professor of pc science at Goethe University and an affiliated professor on the MIT-IBM Watson AI Lab. The work might be introduced on the Conference on Computer Vision and Pattern Recognition.

    Syncing up

    This work builds upon a machine-learning technique the researchers developed just a few years in the past, which supplied an environment friendly option to practice a multimodal mannequin to concurrently course of audio and visible information without the necessity for human labels.

    The researchers feed this mannequin, referred to as CAV-MAE, unlabeled video clips and it encodes the visible and audio information individually into representations referred to as tokens. Using the pure audio from the recording, the mannequin routinely learns to map corresponding pairs of audio and visible tokens shut collectively inside its inside illustration area.

    They discovered that utilizing two studying goals balances the mannequin’s studying course of, which permits CAV-MAE to grasp the corresponding audio and visible information whereas bettering its means to get well video clips that match person queries.

    But CAV-MAE treats audio and visible samples as one unit, so a 10-second video clip and the sound of a door slamming are mapped collectively, even when that audio occasion occurs in only one second of the video.

    In their improved mannequin, referred to as CAV-MAE Sync, the researchers break up the audio into smaller home windows earlier than the mannequin computes its representations of the info, so it generates separate representations that correspond to every smaller window of audio.

    During coaching, the mannequin learns to affiliate one video body with the audio that happens throughout simply that body.

    “By doing that, the model learns a finer-grained correspondence, which helps with performance later when we aggregate this information,” Araujo says.

    They additionally included architectural enhancements that assist the mannequin steadiness its two studying goals.

    Adding “wiggle room”

    The mannequin incorporates a contrastive goal, the place it learns to affiliate comparable audio and visible information, and a reconstruction goal which goals to get well particular audio and visible information primarily based on person queries.

    In CAV-MAE Sync, the researchers launched two new kinds of information representations, or tokens, to enhance the mannequin’s studying means.

    They embrace devoted “global tokens” that assist with the contrastive studying goal and devoted “register tokens” that assist the mannequin deal with essential particulars for the reconstruction goal.

    “Essentially, we add a bit more wiggle room to the model so it can perform each of these two tasks, contrastive and reconstructive, a bit more independently. That benefitted overall performance,” Araujo provides.

    While the researchers had some instinct these enhancements would enhance the efficiency of CAV-MAE Sync, it took a cautious mixture of methods to shift the mannequin within the route they needed it to go.

    “Because we have multiple modalities, we need a good model for both modalities by themselves, but we also need to get them to fuse together and collaborate,” Rouditchenko says.

    In the top, their enhancements improved the mannequin’s means to retrieve movies primarily based on an audio question and predict the category of an audio-visual scene, like a canine barking or an instrument taking part in.

    Its outcomes have been extra correct than their prior work, and it additionally carried out higher than extra advanced, state-of-the-art strategies that require bigger quantities of coaching information.

    “Sometimes, very simple ideas or little patterns you see in the data have big value when applied on top of a model you are working on,” Araujo says.

    In the longer term, the researchers wish to incorporate new fashions that generate higher information representations into CAV-MAE Sync, which might enhance efficiency. They additionally wish to allow their system to deal with textual content information, which might be an essential step towards producing an audiovisual massive language mannequin.

    This work is funded, partially, by the German Federal Ministry of Education and Research and the MIT-IBM Watson AI Lab.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    AI

    Study shows vision-language models can’t handle queries with negation words | Ztoog

    AI

    How a new type of AI is helping police skirt facial recognition bans

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    The Future

    NFTs died a slow, painful death in 2023 as most are now worthless

    Bored Ape Yacht Club NFTs aren’t price what they as soon as have beenmundissima/Alamy The…

    Technology

    A look at TikTok's #medicalmoms, where mothers of disabled and chronically ill kids share parenthood journeys, raising privacy, consent, and autonomy concerns (Fortesa Latifi/Washington Post)

    Fortesa Latifi / Washington Post: A look at TikTok’s #medicalmoms, where mothers of disabled and…

    The Future

    How to verify a data breach

    Over the years Ztoog has extensively coated data breaches. In truth, a few of our…

    Gadgets

    Redis’ license change and forking are a mess that everybody can feel bad about

    Enlarge / An Amazon Web Services (AWS) knowledge middle below building in Stone Ridge, Virginia,…

    AI

    The future of generative AI is niche, not generalized

    Whether or not this actually quantities to an “iPhone moment” or a severe risk to…

    Our Picks
    AI

    Benchmarking animal-level agility with quadruped robots – Ztoog

    Crypto

    Coinbase Says Cryptocurrency for International Money Transfers Growing in Popularity

    Mobile

    Titanium, bigger display, and more

    Categories
    • AI (1,489)
    • Crypto (1,750)
    • Gadgets (1,801)
    • Mobile (1,846)
    • Science (1,861)
    • Technology (1,797)
    • The Future (1,643)
    Most Popular
    Mobile

    Galaxy S24 might be even better at gaming then the iPhone 15 Pro!

    Technology

    Best Mini Fridge for Baby Bottles in 2023

    Crypto

    Bitcoin Funds Witness Largest Weekly Outflows Since March: Report

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.