Close Menu
Ztoog
    What's Hot
    Science

    Google, Environmental Defense Fund will track methane emissions from space

    Gadgets

    How to Find Film for Your Old Polaroid Camera (2024)

    Technology

    Best Mother’s Day Gifts for a Mom Who Cooks, Eats, Drinks (or Does All Three)

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

      LiberNovo Omni: The World’s First Dynamic Ergonomic Chair

    • Technology

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

      5 Skills Kids (and Adults) Need in an AI World – O’Reilly

    • Gadgets

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

    • Mobile

      Deals: the Galaxy S25 series comes with a free tablet, Google Pixels heavily discounted

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

    • Science

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

      A trip to the farm where loofahs grow on vines

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

    Ztoog
    Home » This Paper from NYU and Google Explains How Joint Speech-Text Encoders Overcome Sequence-Length Mismatch in Cross-Modal Representations
    AI

    This Paper from NYU and Google Explains How Joint Speech-Text Encoders Overcome Sequence-Length Mismatch in Cross-Modal Representations

    Facebook Twitter Pinterest WhatsApp
    This Paper from NYU and Google Explains How Joint Speech-Text Encoders Overcome Sequence-Length Mismatch in Cross-Modal Representations
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    It is changing into more and more obvious that very massive fashions skilled on large unsupervised corpora in a single modality can obtain outstanding outcomes. This has been proved each in the audio area, the place a single mannequin has been proven to adapt to a shock big range of acoustic duties and in the textual content area, the place language fashions have attained distinctive zero-shot capabilities. Similar achievements have prompted the inquiry into the right way to make use of comparable methods for conditions combining two modalities, which have historically relied on manually paired information.

    One fascinating method is to coach an enormous encoder on each modalities in order that both one will be offered as an unpaired instance and the encoder will study to map the 2 to comparable locations in illustration house. Achievable and able to state-of-the-art efficiency on quite a few image and textual content comprehension duties utilizing a single mannequin, such a illustration has been demonstrated to be possible in the picture/text-domain.

    New analysis by the New York University and Google investigates whether or not the efficiency positive aspects discovered with the express alignments could also be achieved by making use of consistency regularization to the implicit alignments discovered that in upsampling methods. They obtain this by creating a way, motivated by dynamic time warping, that optimally aligns the encoder’s illustration of a speech and textual content instance. In the absence of an express alignment mannequin, the crew exhibit that the optimum alignment is not only acquired throughout coaching but in addition improves as one progresses via the community’s layers. 

    To facilitate pretraining on unpaired voice and textual content information, there was a current development towards fashions with a joint speech and textual content encoder in the sector of speech recognition. The lengthier sequence used to signify speech gives a novel issue for speech recognition because it includes two sequence modalities. Because of this, evaluating an encoder’s speech illustration to its textual content illustration frame-by-frame turns into a tougher course of, though each modalities are represented in the identical embedding house.

    Finally, the work demonstrates that, in a monolingual and multilingual setting, important WER enhancements will be achieved towards robust, semi-supervised baselines with none discovered alignment mannequin by modifying the standards of the consistency regularization to encourage consistency beneath some alignment moderately than a direct frame-wise comparability. Based on their findings, it seems that tolerating misalignment is all that’s wanted to implement consistency in cross-modal representations.


    Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t overlook to hitch our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, the place we share the most recent AI analysis information, cool AI tasks, and extra.


    Dhanshree Shenwai is a Computer Science Engineer and has a superb expertise in FinTech corporations masking Financial, Cards & Payments and Banking area with eager curiosity in functions of AI. She is smitten by exploring new applied sciences and developments in at present’s evolving world making everybody’s life straightforward.


    🔥 Use SQL to foretell the long run (Sponsored)

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Gadgets

    Satisfyer Pro 2 Gen 3 Review: Starter Pleasure

    intercourse toy ought to really feel like an extension of your physique. When you’re utilizing…

    AI

    Can a Single Model Revolutionize Music Understanding and Generation? This Paper Introduces the Groundbreaking MU-LLaMA and M2UGen Models

    The necessity for large-scale music datasets with pure language captions is a issue for text-to-music…

    AI

    Allen Institute for AI Releases Tulu 2.5 Suite on Hugging Face: Advanced AI Models Trained with DPO and PPO, Featuring Reward and Value Models

    The launch of the Tulu 2.5 suite by the Allen Institute for AI marks a…

    Science

    SpaceX: Was the first attempt to launch the Starship rocket a failure?

    Starship is the greatest rocket ever to fly – and to blow upSpaceX The following…

    Crypto

    Public Miners Account for Just 28% – Is Decentralization in Jeopardy?”

    In the world of Bitcoin mining, the idea of decentralization versus centralization has been a…

    Our Picks
    The Future

    Y Combinator’s Demo Day is back in person

    Mobile

    OPPO Find N3 Flip review: The right stuff

    AI

    Why artists are becoming less scared of AI

    Categories
    • AI (1,493)
    • Crypto (1,753)
    • Gadgets (1,805)
    • Mobile (1,851)
    • Science (1,866)
    • Technology (1,802)
    • The Future (1,648)
    Most Popular
    Crypto

    Market Alert: Ethereum Faces Potential Downfall as Dencun Upgrade Looms

    AI

    A Minecraft town of AI characters made friends, invented jobs, and spread religion

    Crypto

    Crypto Strategist Unveils The ‘Most Brutal’ Market Phase

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.