Close Menu
Ztoog
    What's Hot
    The Future

    Electric vehicles race combustion cars in ‘battle of technologies’

    Science

    How many people would it take to settle Mars?

    Technology

    Nvidia’s stellar 2023 performance: A decade’s best in stock market

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Residential solar panel installation: What to expect

      How to Get Bot Lobbies in Fortnite? (2025 Guide)

      Top 12 time & billing software for consultants (2025 reviews)

      AI data scrapers are an existential threat to Wikipedia

      Star Wars’ Season of the Force Takes Over Disneyland

    • Technology

      Stevens Prof Kevin Lu Drives Standards Forward

      RFK Jr. fires vaccine advisory board: What to know

      Does Colossal Biosciences’ dire wolf creation justify its $10B+ valuation?

      Paris-based Pennylane, which makes cloud-based accounting software, raised €75M, doubling its valuation to €2B, led by Sequoia and with Alphabet among investors (Ryan Browne/CNBC)

      TikTok ban scores yet another delay — pushed back to June

    • Gadgets

      RedMagic Gaming Tablet 3 Pro Debuts With Snapdragon 8 Elite And 165 Hz OLED Display

      Withings ScanWatch Nova Review: A Stylish Hybrid That Puts Health First

      Breast pump startup Willow acquires assets of Elvie as UK women’s health pioneer moves into administration

      Raccoon or robber? Find out with sub $90 night vision binoculars

      Nomad Sale: 5 Great Deals on Our Favorite Accessories

    • Mobile

      Amazon knocks the Garmin Forerunner 265 back to its lowest price

      This new flagship phone has two zoom lenses, but only one zoom camera (wait, what?)

      Moto G Stylus (2025) is now official ahead of April 17 release

      Apple’s iOS 18.5 beta update is pretty barebones, but more important than it seems

      Costco offering Apple AirTag 4-Pack at just $64.99

    • Science

      Experimental retina implants give mice infrared vision

      8 Breakthroughs Tackling Pollution Across Air, Land, and Sea

      Why we can’t squash the common cold, even after 100 years of studying it

      Welcome to the Worst Allergy Season Ever

      How optical clocks are redefining time and physics

    • AI

      The problem with AI agents

      Inroads to personalized AI trip planning | Ztoog

      AI companions are the final stage of digital addiction, and lawmakers are taking aim

      New method assesses and improves the reliability of radiologists’ diagnostic reports | Ztoog

      How do you teach an AI model to give therapy?

    • Crypto

      Ethereum Price Could Rally To $10,000 If This Major Resistance Is Broke

      X names Polymarket as its official prediction market partner

      Kirby McInerney LLP Announces a Proposed Settlement in the DraftKings NFT Settlement

      Ethereum Whales Buy the Dip – Over 130K ETH Added In A Single Day

      Why Buying Bitcoin Now Is Better Than Later As BTC Price Consolidates Within Falling Wedge

    Ztoog
    Home » Microsoft Researchers Unveil PromptTTS 2: Revolutionizing Text-to-Speech with Enhanced Voice Variability and Cost-Effective Prompt Generation
    AI

    Microsoft Researchers Unveil PromptTTS 2: Revolutionizing Text-to-Speech with Enhanced Voice Variability and Cost-Effective Prompt Generation

    Facebook Twitter Pinterest WhatsApp
    Microsoft Researchers Unveil PromptTTS 2: Revolutionizing Text-to-Speech with Enhanced Voice Variability and Cost-Effective Prompt Generation
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    The intelligibility and naturalness of synthesized speech have improved attributable to current developments in text-to-speech techniques. Large-scale TTS techniques have been created for multi-speaker settings, and some TTS techniques have reached a top quality equal to single-speaker recordings. Despite these developments, modeling voice variability remains to be tough since other ways of claiming the identical phrase can talk further info, reminiscent of emotion and tone. Traditional TTS methods often depend on speaker info or speech prompts to simulate the variability in voice. Still, these methods aren’t user-friendly as a result of the speaker ID is pre-defined, and the suitable speech immediate is tough to find or doesn’t exist. 

    A extra promising strategy for modeling voice variability is to make the most of textual content prompts that specify voice options since pure language is a helpful interface for customers to convey their intent on voice manufacturing. This technique makes it easy to create voices utilizing textual content prompts. TTS techniques based mostly on textual content prompts are usually skilled utilizing a dataset of speech and the textual content immediate that corresponds to it. The textual content immediate describing the variability or fashion of the voice is used to situation how the mannequin generates the voice. 

    Text immediate TTS techniques proceed to face two major difficulties: 

    • One-to-Many Challenge: Because voice high quality varies from individual to individual, it’s exhausting for written directions to characterize all speech facets precisely. Different voice samples could ineluctably correlate to the identical immediate. The one-to-many phenomena make TTS mannequin coaching tougher and can lead to over-fitting or mode collapse. As far as they know, no procedures have been created expressly to handle the one-to-many downside in TTS techniques based mostly on textual content prompts.

    • Data-Scale Challenge: Since textual content prompts are unusual on the web, compiling a dataset of textual content prompts defining the voice isn’t straightforward. 

    As a consequence, distributors are employed to create prompts, which is each costly and time-consuming. The immediate datasets are usually tiny or personal, making it tough to do additional analysis on prompt-based TTS techniques. In their work, they supply PromptTTS 2, which makes a variation community proposal to mannequin the voice variability info of speech not captured by the prompts. It makes use of the large language mannequin to provide high-quality prompts to beat the challenges above. They counsel a variation community to anticipate the lacking details about voice variability from the textual content immediate for the one-to-many problem. The reference speech, thought to incorporate all info on voice variability, is used to coach the variation community. 

    A textual content immediate encoder for textual content prompts, a reference speech encoder for reference speech, and a TTS module to synthesize speech based mostly on the representations retrieved by the textual content immediate encoder and reference speech encoder make up the TTS mannequin in PromptTTS 2. Based on the fast illustration from textual content immediate encoder 3, a variation community is skilled to foretell the reference illustration from the reference voice encoder. They could modify the qualities of synthesized speech by utilizing the diffusion mannequin within the variation community to pick numerous details about voice variability from Gaussian noise conditioned on textual content prompts, giving customers extra freedom when producing voices.

    Researchers from Microsoft counsel a pipeline to robotically create textual content prompts for speech utilizing a speech understanding mannequin to acknowledge voice traits from speech and an enormous language mannequin to assemble textual content prompts relying on recognition outcomes to handle the data-scale problem. In specific, they use a speech understanding mannequin to establish the attribute values for every speech pattern inside a speech dataset to explain the voice from varied options. The textual content immediate is then created by placing these phrases collectively, with every attribute’s description given in its sentence. In distinction to earlier research, which relied on distributors to assemble and mix phrases, PromptTTS 2 makes use of huge language fashions which have confirmed able to performing a spread of duties at a stage akin to that of an individual. 

    They give LLM directions to put in writing wonderful prompts that embody the qualities and join the phrases into an intensive immediate. Thanks to this fully automated workflow, there is no such thing as a longer any want for human intervention in immediate authoring. The following is a abstract of this paper’s contributions: 

    • To remedy the one-to-many downside in textual content prompt-based TTS techniques, they construct a diffusion model-based variation community to explain the voice variability not lined by the textual content immediate. The voice variability could also be managed by choosing samples from varied Gaussian noises conditioned on the textual content immediate throughout inference. 

    • They construct and publish a textual content immediate dataset produced by a pipeline for textual content immediate creation and an enormous language mannequin. The pipeline lessens dependency on suppliers by producing prompts of top of the range. 

    • Using 44K hours of speech information, they check PromptTTS 2 on a large speech dataset. According to experimental findings, PromptTTS 2 surpasses earlier research in producing voices that extra carefully match the textual content immediate whereas supporting limiting vocal variability by sampling from Gaussian noise.


    Check out the Paper and Samples. All Credit For This Research Goes To the Researchers on This Project. Also, don’t neglect to affix our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, the place we share the newest AI analysis information, cool AI initiatives, and extra.

    If you want our work, you’ll love our e-newsletter..


    Aneesh Tickoo is a consulting intern at MarktechPost. He is at present pursuing his undergraduate diploma in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time engaged on initiatives aimed toward harnessing the ability of machine studying. His analysis curiosity is picture processing and is obsessed with constructing options round it. He loves to attach with individuals and collaborate on attention-grabbing initiatives.


    🚀 The finish of venture administration by people (Sponsored)

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    The problem with AI agents

    AI

    Inroads to personalized AI trip planning | Ztoog

    AI

    AI companions are the final stage of digital addiction, and lawmakers are taking aim

    AI

    New method assesses and improves the reliability of radiologists’ diagnostic reports | Ztoog

    AI

    How do you teach an AI model to give therapy?

    AI

    Researchers teach LLMs to solve complex planning challenges | Ztoog

    AI

    The first trial of generative AI therapy shows it might help with depression

    AI

    Making higher education more accessible to students in Pakistan | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    AI

    Dear Taylor Swift, we’re sorry about those explicit deepfakes

    I can solely think about the way you should be feeling after sexually explicit deepfake…

    Gadgets

    From Nano-Tech to Global Impact: Jchi Global Eco-Friendly Materials

    Jchi world (official web site) is a startup that makes a speciality of nano-technology supplies…

    Gadgets

    Paralyzed Man Walks Again: Digital Bridge Connects Brain And Spinal Cord

    In a outstanding breakthrough, a paralyzed man has regained the power to stroll, climb stairs,…

    Gadgets

    Save 30% on a powerful Jackery solar generator before the next blackout

    We might earn income from the merchandise out there on this web page and take…

    Gadgets

    ‘Diablo IV’, ‘Star Wars’, and More | WIRED

    One of the good issues about residing in a world the place console exclusives are…

    Our Picks
    The Future

    The World’s Lakes Are Shrinking

    AI

    These documents are influencing the DOGE-sphere’s agenda

    Science

    NASA indefinitely delays return of Starliner to review propulsion data

    Categories
    • AI (1,470)
    • Crypto (1,734)
    • Gadgets (1,785)
    • Mobile (1,825)
    • Science (1,837)
    • Technology (1,774)
    • The Future (1,620)
    Most Popular
    The Future

    5 Best Virtual Business Phone Systems for Small Teams and Businesses

    AI

    New method uses crowdsourced feedback to help train robots | Ztoog

    Science

    The Ultra-Efficient Farm of the Future Is in the Sky

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.