Close Menu
Ztoog
    What's Hot
    AI

    Image recognition accuracy: An unseen challenge confounding today’s AI | Ztoog

    Technology

    Symptoms of Heart Disease and How to Prevent the ‘American Curse’

    Technology

    Tesla whistleblower calls cars with Autopilot “experiments in public roads”

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

      India-Pak conflict: Pak appoints ISI chief, appointment comes in backdrop of the Pahalgam attack

    • Technology

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

      The more Google kills Fitbit, the more I want a Fitbit Sense 3

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

      ‘Dark photon’ theory of light aims to tear up a century of physics

    • AI

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

      The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    • Crypto

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

      Speak at Ztoog Disrupt 2025: Applications now open

    Ztoog
    Home » Scaling audio-visual learning without labels | Ztoog
    AI

    Scaling audio-visual learning without labels | Ztoog

    Facebook Twitter Pinterest WhatsApp
    Scaling audio-visual learning without labels | Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Researchers from MIT, the MIT-IBM Watson AI Lab, IBM Research, and elsewhere have developed a brand new approach for analyzing unlabeled audio and visible information that would enhance the efficiency of machine-learning fashions utilized in functions like speech recognition and object detection. The work, for the primary time, combines two architectures of self-supervised learning, contrastive learning and masked information modeling, in an effort to scale machine-learning duties like occasion classification in single- and multimodal information without the necessity for annotation, thereby replicating how people perceive and understand our world.

    “A larger portion of human knowledge is learned in a self-supervised way, because we don’t always get supervision signals, and we want to enable the machine-learning model to have the same ability,” says Yuan Gong, an MIT postdoc within the Computer Science and Artificial Intelligence Laboratory (CSAIL).

    “So, another way to put it is that self-supervised learning often forms the foundation of an initial model, because it can learn on vast amounts of unlabeled data. And then you can use classical, supervised learning or reinforcement learning to fine tune the model to something particular if you want to,” says Jim Glass, an MIT senior analysis scientist and member of the MIT-IBM Watson AI Lab.

    The approach, known as the contrastive audio-visual masked autoencoder (CAV-MAE), is a kind of neural community that may be taught to extract and map significant latent representations into high-dimensional house from acoustic and visible information by coaching on giant YouTube datasets of audio and video 10-second clips. The researchers say the approach is simpler than earlier approaches as a result of it explicitly fashions the relationships between audio and visible information in a approach that different strategies don’t.

    Joining Gong and Glass on the research are graduate college students Andrew Rouditchenko and Alexander H. Liu of MIT, David Harwath PhD ’18 of the University of Texas at Austin, and MIT-IBM Watson AI Lab members Leonid Karlinsky and Hilde Kuehne. Kuehne can also be affiliated with Goethe University Frankfurt. The technique was not too long ago offered on the International Conference on Learning Representations.

    A joint and coordinated strategy

    The CAV-MAE works by “learning by prediction” and “learning by comparison,” says Gong. The masked information modeling, or the prediction technique, takes a video together with its coordinated audio waveform, converts the audio to a spectrogram, and masks 75 % of each. The unmasked information is tokenized, then fed into separate audio and visible encoders earlier than getting into a joint encoder/decoder, the place the mannequin is requested to recuperate the lacking information. The distinction (reconstruction loss) between the ensuing reconstructed prediction and the unique audio-visual mixture is then used to coach the mannequin for higher efficiency. An instance of this may be protecting a part of a video of a piano and a part of a spectrogram of piano music, after which asking the mannequin to attempt to decide the masked inputs. Unfortunately, this technique might not seize the affiliation between the video and audio pair, whereas contrastive learning leverages this, however might discard some modality-unique data, just like the background in a video.

    Contrastive learning goals to map representations which can be related shut to one another. For instance, the mannequin will try to put completely different video and audio information of various parrots shut to one another and additional away from pairs of video and audio of guitars taking part in. In an analogous vogue to masked autoencoding, audio-visual pairs are handed into separate modality encoders; nevertheless, the audio and visible parts are stored individually inside the joint encoder earlier than the mannequin performs pooling and contrastive loss. In this fashion, contrastive learning tries to establish the components of every audio or video which can be most related to the opposite. For instance, if a video reveals somebody talking and the corresponding audio clip accommodates speech, the autoencoder will be taught to affiliate the mouth actions of the speaker with the phrases being spoken. It will then regulate the mannequin’s parameters in order that these inputs are represented shut to one another. Ultimately, the CAV-MAE technique combines each strategies with a number of ahead information streams with masking as a primary step, modality-specific encoders, and layer normalization in order that the illustration strengths are related.

    “We [then] wanted to compare the proposed CAV-MAE with a model trained only with a masked autoencoder and a model trained only with contrastive learning, because we want to show that by combining masked autoencoder and contrastive learning, we can get some performance improvement,” says Gong, “and the results support our hypothesis that there’s obvious improvement.”

    The researchers examined CAV-MAE — in addition to their technique without contrastive loss or a masked autoencoder — towards different state-of-the-art strategies on audio-visual retrieval and audio-visual occasion classification duties utilizing customary AudioSet (20K and 2M) and VGGSound datasets — labeled, practical quick clips, which might embody a number of sounds. Audio-visual retrieval signifies that the mannequin sees both the audio or visible element of a question pair and searches for the lacking one; occasion classification consists of figuring out actions or sounds inside information, like an individual singing or a automobile driving.

    Overall, they discovered that contrastive learning and masked information modeling are complementary strategies. CAV-MAE was capable of outperform earlier strategies (with absolutely self-supervised pre-training) by about 2 % for occasion classification efficiency verses fashions with comparable computation and, extra impressively, stored tempo with or outperformed fashions with industry-level computational assets. The workforce’s mannequin ranked equally to fashions skilled with solely the contrastive loss. And surprisingly, the workforce says, the incorporation of multi-modal information into CAV-MAE pre-training vastly improves the fine-tuning of single-modality illustration by way of supervised learning (with some labeled information) and efficiency on audio-only occasion classification duties. This demonstrates that, like people, multi-modal data offers an extra “soft label” enhance even for audio or visible solely duties; as an example, it helps the mannequin to know if it’s in search of an electrical or acoustic guitar — a richer supervision sign.

    “I think people like the elegance of this model for combining information in the different audio and visual streams. It has the contrastive and the reconstruction loss, and compared to models that have been evaluated with similar data, it clearly does very well across a range of these tasks,” says Glass.

    Building on this, “one special thing is, our model can do both classification and the retrieval, which is not common,” Gong provides. “Before this work, these methods are used separately, but after this work, I see that most of the audio-visual learning frameworks use contracting loss and the masked autoencoder together, implicitly or explicitly.”

    Bringing self-supervised audio-visual learning into our world

    The researchers see their contribution of the contrastive audio-visual masked autoencoder (CAV-MAE) as an essential milestone and a step ahead for functions, that are more and more transferring from single modality to multi-modality and which require or leverage audio-visual fusion. They hypothesize that sooner or later it may very well be used for motion recognition in realms like sports activities, training, leisure, motor autos, and public security. It might additionally, sooner or later, prolong to different modalities. At this time, the truth that, “this only applies to audio-visual data may be a limitation, but we are targeting multi-modal learning, which is trend of machine learning,” says Gong. “As humans, we have multi-modalities — we have smell, touch — many more things that just audio-visual. So, when we try to build AI, we try to mimic humans somehow, not necessarily from the biological perspective, and this method could [potentially be] generalized to other unexplored modalities.”

    As machine-learning fashions proceed to play an more and more essential function in our lives, strategies like this one will develop into more and more useful.

    This analysis was supported by the MIT-IBM Watson AI Lab.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    AI

    “Periodic table of machine learning” could fuel AI discovery | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    LUNC Sheds 13% Of Its Value

    Terra Luna Classic (LUNC) finds itself grappling with a big setback as its value takes…

    Mobile

    American Pie? For over an hour on Friday, it was the day Apple Music (and other services) died

    Several Apple Services had been down on Friday based on the firm’s System Status assist…

    Science

    Watch: Three bald eagles could hatch any day now

    All eyes are on two new avian web celebrities and their cozy dwelling in Southern…

    The Future

    Millions of Americans Have Cognitive Decline and Don’t Know It

    Millions of Americans and their docs are at nighttime in terms of early cognitive decline,…

    Crypto

    ARK Invest Pivots To Bitcoin As Cathie Wood Expects BTC Price To Explode

    CEO of ARK Invest, Cathie Wood has shared her perspective on the approval timeline for…

    Our Picks
    Mobile

    Premium Galaxy Watch Ultra surfaces in a pair of live images

    Science

    Make it snow! Researchers explore sci-fi scenarios of human weather control

    Gadgets

    MacBooks, Chromebooks lead losers in laptop repairability analysis

    Categories
    • AI (1,482)
    • Crypto (1,744)
    • Gadgets (1,796)
    • Mobile (1,839)
    • Science (1,853)
    • Technology (1,789)
    • The Future (1,635)
    Most Popular
    AI

    Humans may be more likely to believe disinformation generated by AI

    AI

    50+ New Cutting-Edge AI Tools (2023)

    Mobile

    Samsung Galaxy Z Fold5, Galaxy Z Flip5’s India prices and pre-booking offers announced

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.