Close Menu
Ztoog
    What's Hot
    Science

    AI can work out how quantum computers stack up to one another

    Technology

    Radar Trends to Watch: June 2023 – O’Reilly

    Technology

    Risk Management for AI Chatbots – O’Reilly

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      OPPO launches A5 Pro 5G: Premium features at a budget price

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

    • Technology

      What It Is and Why It Matters—Part 1 – O’Reilly

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Nothing is stronger than quantum connections – and now we know why

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

    • AI

      Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

    • Crypto

      Ethereum Breaks Key Resistance In One Massive Move – Higher High Confirms Momentum

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

    Ztoog
    Home » Unsupervised speech-to-speech translation from monolingual data – Google Research Blog
    AI

    Unsupervised speech-to-speech translation from monolingual data – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Unsupervised speech-to-speech translation from monolingual data – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Eliya Nachmani, Research Scientist, and Michelle Tadmor Ramanovich, Software Engineer, Google Research

    Speech-to-speech translation (S2ST) is a sort of machine translation that converts spoken language from one language to a different. This expertise has the potential to interrupt down language obstacles and facilitate communication between folks from totally different cultures and backgrounds.

    Previously, we launched Translatotron 1 and Translatotron 2, the primary ever fashions that had been in a position to immediately translate speech between two languages. However they had been educated in supervised settings with parallel speech data. The shortage of parallel speech data is a serious problem on this discipline, a lot that the majority public datasets are semi- or fully-synthesized from textual content. This provides extra hurdles to studying translation and reconstruction of speech attributes that aren’t represented within the textual content and are thus not mirrored within the synthesized coaching data.

    Here we current Translatotron 3, a novel unsupervised speech-to-speech translation structure. In Translatotron 3, we present that it’s doable to study a speech-to-speech translation process from monolingual data alone. This methodology opens the door not solely to translation between extra language pairs but additionally in the direction of translation of the non-textual speech attributes akin to pauses, talking charges, and speaker id. Our methodology doesn’t embrace any direct supervision to focus on languages and due to this fact we consider it’s the proper route for paralinguistic traits (e.g., akin to tone, emotion) of the supply speech to be preserved throughout translation. To allow speech-to-speech translation, we use back-translation, which is a method from unsupervised machine translation (UMT) the place an artificial translation of the supply language is used to translate texts with out bilingual textual content datasets. Experimental ends in speech-to-speech translation duties between Spanish and English present that Translatotron 3 outperforms a baseline cascade system.

    Translatotron 3

    Translatotron 3 addresses the issue of unsupervised S2ST, which might eradicate the requirement for bilingual speech datasets. To do that, Translatotron 3’s design incorporates three key points:

    1. Pre-training all the mannequin as a masked autoencoder with SpecAugment, a easy data augmentation methodology for speech recognition that operates on the logarithmic mel spectogram of the enter audio (as a substitute of the uncooked audio itself) and is proven to successfully enhance the generalization capabilities of the encoder.
    2. Unsupervised embedding mapping based mostly on multilingual unsupervised embeddings (MUSE), which is educated on unpaired languages however permits the mannequin to study an embedding house that’s shared between the supply and goal languages.
    3. A reconstruction loss based mostly on back-translation, to coach an encoder-decoder direct S2ST mannequin in a totally unsupervised method.

    The mannequin is educated utilizing a mix of the unsupervised MUSE embedding loss, reconstruction loss, and S2S back-translation loss. During inference, the shared encoder is utilized to encode the enter right into a multilingual embedding house, which is subsequently decoded by the goal language decoder.

    Architecture

    Translatotron 3 employs a shared encoder to encode each the supply and goal languages. The decoder consists of a linguistic decoder, an acoustic synthesizer (answerable for acoustic era of the translation speech), and a singular consideration module, like Translatotron 2. However, for Translatotron 3 there are two decoders, one for the supply language and one other for the goal language. During coaching, we use monolingual speech-text datasets (i.e., these data are made up of speech-text pairs; they’re not translations).

    Encoder

    The encoder has the identical structure because the speech encoder within the Translatotron 2. The output of the encoder is break up into two elements: the primary half incorporates semantic info whereas the second half incorporates acoustic info. By utilizing the MUSE loss, the primary half of the output is educated to be the MUSE embeddings of the textual content of the enter speech spectrogram. The latter half is up to date with out the MUSE loss. It is essential to notice that the identical encoder is shared between supply and goal languages. Furthermore, the MUSE embedding is multilingual in nature. As a consequence, the encoder is ready to study a multilingual embedding house throughout supply and goal languages. This permits a extra environment friendly and efficient encoding of the enter, because the encoder is ready to encode speech from each languages into a standard embedding house, quite than sustaining a separate embedding house for every language.

    Decoder

    Like Translatotron 2, the decoder consists of three distinct elements, specifically the linguistic decoder, the acoustic synthesizer, and the eye module. To successfully deal with the totally different properties of the supply and goal languages, nevertheless, Translatotron 3 has two separate decoders, for the supply and goal languages.

    Two half coaching

    The coaching methodology consists of two elements: (1) auto-encoding with reconstruction and (2) a back-translation time period. In the primary half, the community is educated to auto-encode the enter to a multilingual embedding house utilizing the MUSE loss and the reconstruction loss. This part goals to make sure that the community generates significant multilingual representations. In the second half, the community is additional educated to translate the enter spectrogram by using the back-translation loss. To mitigate the problem of catastrophic forgetting and imposing the latent house to be multilingual, the MUSE loss and the reconstruction loss are additionally utilized on this second a part of coaching. To make sure that the encoder learns significant properties of the enter, quite than merely reconstructing the enter, we apply SpecAugment to encoder enter at each phases. It has been proven to successfully enhance the generalization capabilities of the encoder by augmenting the enter data.

    Training goal

    During the back-translation coaching part (illustrated within the part under), the community is educated to translate the enter spectrogram to the goal language after which again to the supply language. The aim of back-translation is to implement the latent house to be multilingual. To obtain this, the next losses are utilized:

    • MUSE loss: The MUSE loss measures the similarity between the multilingual embedding of the enter spectrogram and the multilingual embedding of the back-translated spectrogram.
    • Reconstruction loss: The reconstruction loss measures the similarity between the enter spectrogram and the back-translated spectrogram.

    In addition to those losses, SpecAugment is utilized to the encoder enter at each phases. Before the back-translation coaching part, the community is educated to auto-encode the enter to a multilingual embedding house utilizing the MUSE loss and reconstruction loss.

    MUSE loss

    To make sure that the encoder generates multilingual representations which are significant for each decoders, we make use of a MUSE loss throughout coaching. The MUSE loss forces the encoder to generate such a illustration by utilizing pre-trained MUSE embeddings. During the coaching course of, given an enter textual content transcript, we extract the corresponding MUSE embeddings from the embeddings of the enter language. The error between MUSE embeddings and the output vectors of the encoder is then minimized. Note that the encoder is detached to the language of the enter throughout inference because of the multilingual nature of the embeddings.

    The coaching and inference in Translatotron 3. Training consists of the reconstruction loss by way of the auto-encoding path and employs the reconstruction loss by way of back-translation.

    Audio samples

    Following are examples of direct speech-to-speech translation from Translatotron 3:

    Spanish-to-English (on Conversational dataset)

    Input (Spanish)
    TTS-synthesized reference (English)   
    Translatotron 3 (English)

    Spanish-to-English (on CommonVoice11 Synthesized dataset)

    Input (Spanish)
    TTS-synthesized reference (English)   
    Translatotron 3 (English)

    Spanish-to-English (on CommonVoice11 dataset)

    Input (Spanish)
    TTS reference (English)
    Translatotron 3 (English)   

    Performance

    To empirically consider the efficiency of the proposed method, we performed experiments on English and Spanish utilizing varied datasets, together with the Common Voice 11 dataset, in addition to two synthesized datasets derived from the Conversational and Common Voice 11 datasets.

    The translation high quality was measured by BLEU (increased is healthier) on ASR (computerized speech recognition) transcriptions from the translated speech, in comparison with the corresponding reference translation textual content. Whereas, the speech high quality is measured by the MOS rating (increased is healthier). Furthermore, the speaker similarity is measured by the common cosine similarity (increased is healthier).

    Because Translatotron 3 is an unsupervised methodology, as a baseline we used a cascaded S2ST system that’s mixed from ASR, unsupervised machine translation (UMT), and TTS (text-to-speech). Specifically, we make use of UMT that makes use of the closest neighbor within the embedding house so as to create the translation.

    Translatotron 3 outperforms the baseline by massive margins in each side we measured: translation high quality, speaker similarity, and speech high quality. It notably excelled on the conversational corpus. Moreover, Translatotron 3 achieves speech naturalness just like that of the bottom reality audio samples (measured by MOS, increased is healthier).

    Translation high quality (measured by BLEU, the place increased is healthier) evaluated on three Spanish-English corpora.
    Speech similarity (measured by common cosine similarity between enter speaker and output speaker, the place increased is healthier) evaluated on three Spanish-English corpora.
    Mean-opinion-score (measured by common MOS metric, the place increased is healthier) evaluated on three Spanish-English corpora.

    Future work

    As future work, we want to lengthen the work to extra languages and examine whether or not zero-shot S2ST will be utilized with the back-translation method. We would additionally like to look at using back-translation with several types of speech data, akin to noisy speech and low-resource languages.

    Acknowledgments

    The direct contributors to this work embrace Eliya Nachmani, Alon Levkovitch, Yifan Ding, Chulayutsh Asawaroengchai, Heiga Zhen, and Michelle Tadmor Ramanovich. We additionally thank Yu Zhang, Yuma Koizumi, Soroosh Mariooryad, RJ Skerry-Ryan, Neil Zeghidour, Christian Frank, Marco Tagliasacchi, Nadav Bar, Benny Schlesinger and Yonghui Wu.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    The Future

    OpenAI debuts ChatGPT subscription aimed at small teams

    OpenAI is launching a brand new subscription plan for ChatGPT, its viral AI-powered chatbot, aimed…

    The Future

    Inside the gigafactory producing the greenest batteries in the world

    The Northvolt Ett manufacturing facility in SwedenNorthvolt Ett On a journey out of Skellefteå, an…

    Gadgets

    Pocket almost $300 in savings on this new-to-you MacBook Pro

    We might earn income from the merchandise accessible on this web page and take part…

    Crypto

    Texas Votes to Require Exchanges’ Proof of Reserves; Next Stop Governor’s Desk

    Key Takeaways Both Texas’ House and Senate voted in favor to require digital asset service…

    Technology

    Nvidia debuts Earth-2 platform for enhanced climate forecasting

    Nvidia at the moment launched its Earth-2 cloud platform on the GTC 2024 occasion in…

    Our Picks
    The Future

    Unleashing Creativity and Strategy in the Digital Age

    AI

    Multiple AI models help robots execute complex plans more transparently | Ztoog

    Mobile

    Drowning in tabs? Chrome for Android may get a feature to solve your problem

    Categories
    • AI (1,483)
    • Crypto (1,745)
    • Gadgets (1,796)
    • Mobile (1,839)
    • Science (1,854)
    • Technology (1,790)
    • The Future (1,636)
    Most Popular
    Technology

    What happens when you connect AI and cameras to an Etch-a-Sketch? These robot-builders found out

    Crypto

    Bitcoin Unscathed As Crypto Funds Bleed With $342 Million Outflow Streak

    Science

    Christina Koch: ISS, Artemis II and human bowling in zero-gravity

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.