Close Menu
Ztoog
    What's Hot
    AI

    OpenAI’s latest blunder shows the challenges facing Chinese AI models

    Science

    How citizen scientists are protecting ‘glass eels’

    Mobile

    Report says iPhone 16 Pro Max will have the best battery life found on any iPhone

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » This Paper from NYU and Google Explains How Joint Speech-Text Encoders Overcome Sequence-Length Mismatch in Cross-Modal Representations
    AI

    This Paper from NYU and Google Explains How Joint Speech-Text Encoders Overcome Sequence-Length Mismatch in Cross-Modal Representations

    Facebook Twitter Pinterest WhatsApp
    This Paper from NYU and Google Explains How Joint Speech-Text Encoders Overcome Sequence-Length Mismatch in Cross-Modal Representations
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    It is changing into more and more obvious that very massive fashions skilled on large unsupervised corpora in a single modality can obtain outstanding outcomes. This has been proved each in the audio area, the place a single mannequin has been proven to adapt to a shock big range of acoustic duties and in the textual content area, the place language fashions have attained distinctive zero-shot capabilities. Similar achievements have prompted the inquiry into the right way to make use of comparable methods for conditions combining two modalities, which have historically relied on manually paired information.

    One fascinating method is to coach an enormous encoder on each modalities in order that both one will be offered as an unpaired instance and the encoder will study to map the 2 to comparable locations in illustration house. Achievable and able to state-of-the-art efficiency on quite a few image and textual content comprehension duties utilizing a single mannequin, such a illustration has been demonstrated to be possible in the picture/text-domain.

    New analysis by the New York University and Google investigates whether or not the efficiency positive aspects discovered with the express alignments could also be achieved by making use of consistency regularization to the implicit alignments discovered that in upsampling methods. They obtain this by creating a way, motivated by dynamic time warping, that optimally aligns the encoder’s illustration of a speech and textual content instance. In the absence of an express alignment mannequin, the crew exhibit that the optimum alignment is not only acquired throughout coaching but in addition improves as one progresses via the community’s layers. 

    To facilitate pretraining on unpaired voice and textual content information, there was a current development towards fashions with a joint speech and textual content encoder in the sector of speech recognition. The lengthier sequence used to signify speech gives a novel issue for speech recognition because it includes two sequence modalities. Because of this, evaluating an encoder’s speech illustration to its textual content illustration frame-by-frame turns into a tougher course of, though each modalities are represented in the identical embedding house.

    Finally, the work demonstrates that, in a monolingual and multilingual setting, important WER enhancements will be achieved towards robust, semi-supervised baselines with none discovered alignment mannequin by modifying the standards of the consistency regularization to encourage consistency beneath some alignment moderately than a direct frame-wise comparability. Based on their findings, it seems that tolerating misalignment is all that’s wanted to implement consistency in cross-modal representations.


    Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t overlook to hitch our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, the place we share the most recent AI analysis information, cool AI tasks, and extra.


    Dhanshree Shenwai is a Computer Science Engineer and has a superb expertise in FinTech corporations masking Financial, Cards & Payments and Banking area with eager curiosity in functions of AI. She is smitten by exploring new applied sciences and developments in at present’s evolving world making everybody’s life straightforward.


    🔥 Use SQL to foretell the long run (Sponsored)

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    AI

    Why OpenAI’s new model is such a big deal

    I assumed OpenAI’s GPT-4o, its main model on the time, can be completely suited to…

    Technology

    This Week in AI: Let us not forget the humble data annotator

    Keeping up with an business as fast-moving as AI is a tall order. So till an AI…

    AI

    Why we should all be rooting for boring AI

    This story initially appeared in The Algorithm, our weekly publication on AI. To get tales…

    Science

    Astronomers found ultra-hot, Earth-sized exoplanet with a lava hemisphere

    Enlarge / Like Kepler-10 b, illustrated above, newly found exoplanet HD 63433 d is a…

    Mobile

    The TV isn’t your first choice for YouTube viewing

    Ryan Haines / Android AuthorityA latest report asserted that nearly half of all YouTube viewing…

    Our Picks
    Technology

    X launches top up packs for its developer API

    AI

    Meet CipherChat: An AI Framework to Systematically Examine the Generalizability of Safety Alignment to Non-Natural Languages-Specifically Ciphers

    AI

    Cultivating the next generation of AI innovators in a global tech hub

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    Mobile

    Oppressively expensive Galaxy Z Fold 6 is surprisingly affordable right now

    Science

    The Mystery of the Colorado River’s Missing Water

    Mobile

    Download OnePlus 12 wallpapers and live wallpapers from here!

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.