Close Menu
Ztoog
    What's Hot
    Mobile

    iQOO Neo8 Pro debuts Dimensity 9200+ chipset, Neo8 tags along

    The Future

    Disney’s Truly Wild 100th Anniversary Year

    Science

    Report: US needs much more than the IRA to get to net zero by 2050

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Are entangled qubits following a quantum Moore’s law?

      Disneyland’s 70th Anniversary Brings Cartoony Chaos to This Summer’s Celebration

      Story of military airfield in Afghanistan that Biden left in 2021

      Tencent hires WizardLM team, a Microsoft AI group with an odd history

      Today’s NYT Connections Hints, Answers for May 12, #701

    • Technology

      Crypto elite increasingly worried about their personal safety

      Deep dive on the evolution of Microsoft's relationship with OpenAI, from its $1B investment in 2019 through Copilot rollouts and ChatGPT's launch to present day (Bloomberg)

      New leak reveals iPhone Fold won’t look like the Galaxy Z Fold 6 at all

      Apple will use AI and user data in iOS 19 to extend iPhone battery life

      Today’s NYT Wordle Hints, Answer and Help for May 12, #1423

    • Gadgets

      We Hand-Picked the 24 Best Deals From the 2025 REI Anniversary Sale

      “Google wanted that”: Nextcloud decries Android permissions as “gatekeeping”

      Google Tests Automatic Password-to-Passkey Conversion On Android

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

    • Mobile

      The Forerunner 570 & 970 have made Garmin’s tiered strategy clearer than ever

      The iPhone Fold is now being tested with an under-display camera

      T-Mobile takes over one of golf’s biggest events, unleashes unique experiences

      Fitbit’s AI experiments just leveled up with 3 new health tracking features

      Motorola’s Moto Watch needs to start living up to the brand name

    • Science

      Do these Buddhist gods hint at the purpose of China’s super-secret satellites?

      From Espresso to Eco-Brick: How Coffee Waste Fuels 3D-Printed Design

      Ancient three-eyed ‘sea moth’ used its butt to breathe

      Intelligence on Earth Evolved Independently at Least Twice

      Nothing is stronger than quantum connections – and now we know why

    • AI

      With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

      Google DeepMind’s new AI agent cracks real-world problems better than humans can

      Study shows vision-language models can’t handle queries with negation words | Ztoog

      How a new type of AI is helping police skirt facial recognition bans

      Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    • Crypto

      Is Bitcoin Bull Run Back? Daily RSI Shows Only Mild Bullish Momentum

      Robinhood grows its footprint in Canada by acquiring WonderFi

      HashKey Group Announces Launch of HashKey Global MENA with VASP License in UAE

      Ethereum Breaks Key Resistance In One Massive Move – Higher High Confirms Momentum

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

    Ztoog
    Home » Microsoft Researchers Unveil PromptTTS 2: Revolutionizing Text-to-Speech with Enhanced Voice Variability and Cost-Effective Prompt Generation
    AI

    Microsoft Researchers Unveil PromptTTS 2: Revolutionizing Text-to-Speech with Enhanced Voice Variability and Cost-Effective Prompt Generation

    Facebook Twitter Pinterest WhatsApp
    Microsoft Researchers Unveil PromptTTS 2: Revolutionizing Text-to-Speech with Enhanced Voice Variability and Cost-Effective Prompt Generation
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    The intelligibility and naturalness of synthesized speech have improved attributable to current developments in text-to-speech techniques. Large-scale TTS techniques have been created for multi-speaker settings, and some TTS techniques have reached a top quality equal to single-speaker recordings. Despite these developments, modeling voice variability remains to be tough since other ways of claiming the identical phrase can talk further info, reminiscent of emotion and tone. Traditional TTS methods often depend on speaker info or speech prompts to simulate the variability in voice. Still, these methods aren’t user-friendly as a result of the speaker ID is pre-defined, and the suitable speech immediate is tough to find or doesn’t exist. 

    A extra promising strategy for modeling voice variability is to make the most of textual content prompts that specify voice options since pure language is a helpful interface for customers to convey their intent on voice manufacturing. This technique makes it easy to create voices utilizing textual content prompts. TTS techniques based mostly on textual content prompts are usually skilled utilizing a dataset of speech and the textual content immediate that corresponds to it. The textual content immediate describing the variability or fashion of the voice is used to situation how the mannequin generates the voice. 

    Text immediate TTS techniques proceed to face two major difficulties: 

    • One-to-Many Challenge: Because voice high quality varies from individual to individual, it’s exhausting for written directions to characterize all speech facets precisely. Different voice samples could ineluctably correlate to the identical immediate. The one-to-many phenomena make TTS mannequin coaching tougher and can lead to over-fitting or mode collapse. As far as they know, no procedures have been created expressly to handle the one-to-many downside in TTS techniques based mostly on textual content prompts.

    • Data-Scale Challenge: Since textual content prompts are unusual on the web, compiling a dataset of textual content prompts defining the voice isn’t straightforward. 

    As a consequence, distributors are employed to create prompts, which is each costly and time-consuming. The immediate datasets are usually tiny or personal, making it tough to do additional analysis on prompt-based TTS techniques. In their work, they supply PromptTTS 2, which makes a variation community proposal to mannequin the voice variability info of speech not captured by the prompts. It makes use of the large language mannequin to provide high-quality prompts to beat the challenges above. They counsel a variation community to anticipate the lacking details about voice variability from the textual content immediate for the one-to-many problem. The reference speech, thought to incorporate all info on voice variability, is used to coach the variation community. 

    A textual content immediate encoder for textual content prompts, a reference speech encoder for reference speech, and a TTS module to synthesize speech based mostly on the representations retrieved by the textual content immediate encoder and reference speech encoder make up the TTS mannequin in PromptTTS 2. Based on the fast illustration from textual content immediate encoder 3, a variation community is skilled to foretell the reference illustration from the reference voice encoder. They could modify the qualities of synthesized speech by utilizing the diffusion mannequin within the variation community to pick numerous details about voice variability from Gaussian noise conditioned on textual content prompts, giving customers extra freedom when producing voices.

    Researchers from Microsoft counsel a pipeline to robotically create textual content prompts for speech utilizing a speech understanding mannequin to acknowledge voice traits from speech and an enormous language mannequin to assemble textual content prompts relying on recognition outcomes to handle the data-scale problem. In specific, they use a speech understanding mannequin to establish the attribute values for every speech pattern inside a speech dataset to explain the voice from varied options. The textual content immediate is then created by placing these phrases collectively, with every attribute’s description given in its sentence. In distinction to earlier research, which relied on distributors to assemble and mix phrases, PromptTTS 2 makes use of huge language fashions which have confirmed able to performing a spread of duties at a stage akin to that of an individual. 

    They give LLM directions to put in writing wonderful prompts that embody the qualities and join the phrases into an intensive immediate. Thanks to this fully automated workflow, there is no such thing as a longer any want for human intervention in immediate authoring. The following is a abstract of this paper’s contributions: 

    • To remedy the one-to-many downside in textual content prompt-based TTS techniques, they construct a diffusion model-based variation community to explain the voice variability not lined by the textual content immediate. The voice variability could also be managed by choosing samples from varied Gaussian noises conditioned on the textual content immediate throughout inference. 

    • They construct and publish a textual content immediate dataset produced by a pipeline for textual content immediate creation and an enormous language mannequin. The pipeline lessens dependency on suppliers by producing prompts of top of the range. 

    • Using 44K hours of speech information, they check PromptTTS 2 on a large speech dataset. According to experimental findings, PromptTTS 2 surpasses earlier research in producing voices that extra carefully match the textual content immediate whereas supporting limiting vocal variability by sampling from Gaussian noise.


    Check out the Paper and Samples. All Credit For This Research Goes To the Researchers on This Project. Also, don’t neglect to affix our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, the place we share the newest AI analysis information, cool AI initiatives, and extra.

    If you want our work, you’ll love our e-newsletter..


    Aneesh Tickoo is a consulting intern at MarktechPost. He is at present pursuing his undergraduate diploma in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time engaged on initiatives aimed toward harnessing the ability of machine studying. His analysis curiosity is picture processing and is obsessed with constructing options round it. He loves to attach with individuals and collaborate on attention-grabbing initiatives.


    🚀 The finish of venture administration by people (Sponsored)

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    AI

    Study shows vision-language models can’t handle queries with negation words | Ztoog

    AI

    How a new type of AI is helping police skirt facial recognition bans

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Science

    Make these four classic cocktails and become a fluid dynamics expert

    Proteins come collectively to make the froth in a gin fizzAlex Overhiser YOU might imagine…

    AI

    Inside the messy ethics of making war with machines

    This is why a human hand should squeeze the set off, why a human hand…

    Mobile

    Sony Project Q unveiled: a game-steaming handheld

    Sony introduced a new handheld streaming machine that may have the ability to play your…

    Mobile

    The LG StanbyME Go puts a screen and some speakers inside a briefcase

    LG Korea has introduced the LG StanbyME Go 27LX5, a moveable show that is constructed…

    Mobile

    Check out the iQOO Z7 Pro 5G’s camera samples from a weekend in Goa

    iQOO launched the iQOO Z7 in March, which will likely be joined by the Pro…

    Our Picks
    Science

    An Innovative Robotic Hand Closer to Replicating Human Abilities

    Mobile

    Creative Aurvana Ace 2 review

    Technology

    Rising Tide Rents and Robber Baron Rents – O’Reilly

    Categories
    • AI (1,487)
    • Crypto (1,748)
    • Gadgets (1,799)
    • Mobile (1,844)
    • Science (1,858)
    • Technology (1,795)
    • The Future (1,641)
    Most Popular
    The Future

    Store 7TB on a piece of glass, Microsoft will help you

    Mobile

    Top 10 trending phones of week 41

    Technology

    Video Friday: Medusai – IEEE Spectrum

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.