Close Menu
Ztoog
    What's Hot
    Science

    Quantum time travel: The experiment to ‘send a particle into the past’

    Gadgets

    7 Best Bike Locks (2023): U-Locks, Chain Locks, and Tips

    Technology

    The entire state of Illinois is going to be crawling with cicadas

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

      India-Pak conflict: Pak appoints ISI chief, appointment comes in backdrop of the Pahalgam attack

    • Technology

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

      The more Google kills Fitbit, the more I want a Fitbit Sense 3

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

      ‘Dark photon’ theory of light aims to tear up a century of physics

    • AI

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

      The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    • Crypto

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

      Speak at Ztoog Disrupt 2025: Applications now open

    Ztoog
    Home » Rapid text-to-image generation on-device – Google Research Blog
    AI

    Rapid text-to-image generation on-device – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Rapid text-to-image generation on-device – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Yang Zhao, Senior Software Engineer, and Tingbo Hou, Senior Staff Software Engineer, Core ML

    Text-to-image diffusion fashions have proven distinctive capabilities in producing high-quality photos from textual content prompts. However, main fashions characteristic billions of parameters and are consequently costly to run, requiring highly effective desktops or servers (e.g., Stable Diffusion, DALL·E, and Imagen). While current developments in inference options on Android through MediaPipe and iOS through Core ML have been made previously yr, fast (sub-second) text-to-image generation on cellular gadgets has remained out of attain.

    To that finish, in “MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices”, we introduce a novel strategy with the potential for fast text-to-image generation on-device. MobileDiffusion is an environment friendly latent diffusion mannequin particularly designed for cellular gadgets. We additionally undertake DiffusionGAN to realize one-step sampling throughout inference, which fine-tunes a pre-trained diffusion mannequin whereas leveraging a GAN to mannequin the denoising step. We have examined MobileDiffusion on iOS and Android premium gadgets, and it may possibly run in half a second to generate a 512×512 high-quality picture. Its comparably small mannequin measurement of simply 520M parameters makes it uniquely suited to cellular deployment.

          
    Rapid text-to-image generation on-device.

    Background

    The relative inefficiency of text-to-image diffusion fashions arises from two main challenges. First, the inherent design of diffusion fashions requires iterative denoising to generate photos, necessitating a number of evaluations of the mannequin. Second, the complexity of the community structure in text-to-image diffusion fashions includes a considerable variety of parameters, recurrently reaching into the billions and leading to computationally costly evaluations. As a consequence, regardless of the potential advantages of deploying generative fashions on cellular gadgets, similar to enhancing person expertise and addressing rising privateness issues, it stays comparatively unexplored inside the present literature.

    The optimization of inference effectivity in text-to-image diffusion fashions has been an energetic analysis space. Previous research predominantly focus on addressing the primary problem, in search of to cut back the variety of operate evaluations (NFEs). Leveraging superior numerical solvers (e.g., DPM) or distillation strategies (e.g., progressive distillation, consistency distillation), the variety of vital sampling steps have considerably diminished from a number of lots of to single digits. Some current strategies, like DiffusionGAN and Adversarial Diffusion Distillation, even scale back to a single vital step.

    However, on cellular gadgets, even a small variety of analysis steps could be gradual because of the complexity of mannequin structure. Thus far, the architectural effectivity of text-to-image diffusion fashions has obtained comparatively much less consideration. A handful of earlier works briefly touches upon this matter, involving the elimination of redundant neural community blocks (e.g., SnapFusion). However, these efforts lack a complete evaluation of every element inside the mannequin structure, thereby falling wanting offering a holistic information for designing extremely environment friendly architectures.

    MobileDiffusion

    Effectively overcoming the challenges imposed by the restricted computational energy of cellular gadgets requires an in-depth and holistic exploration of the mannequin’s architectural effectivity. In pursuit of this goal, our analysis undertakes an in depth examination of every constituent and computational operation inside Stable Diffusion’s UNet structure. We current a complete information for crafting extremely environment friendly text-to-image diffusion fashions culminating within the MobileDiffusion.

    The design of MobileDiffusion follows that of latent diffusion fashions. It accommodates three parts: a textual content encoder, a diffusion UNet, and a picture decoder. For the textual content encoder, we use CLIP-ViT/L14, which is a small mannequin (125M parameters) appropriate for cellular. We then flip our focus to the diffusion UNet and picture decoder.

    Diffusion UNet

    As illustrated within the determine under, diffusion UNets generally interleave transformer blocks and convolution blocks. We conduct a complete investigation of those two elementary constructing blocks. Throughout the research, we management the coaching pipeline (e.g., knowledge, optimizer) to review the consequences of various architectures.

    In basic text-to-image diffusion fashions, a transformer block consists of a self-attention layer (SA) for modeling long-range dependencies amongst visible options, a cross-attention layer (CA) to seize interactions between textual content conditioning and visible options, and a feed-forward layer (FF) to post-process the output of consideration layers. These transformer blocks maintain a pivotal position in text-to-image diffusion fashions, serving as the first parts accountable for textual content comprehension. However, in addition they pose a big effectivity problem, given the computational expense of the eye operation, which is quadratic to the sequence size. We comply with the concept of UViT structure, which locations extra transformer blocks on the bottleneck of the UNet. This design alternative is motivated by the truth that the eye computation is much less resource-intensive on the bottleneck as a consequence of its decrease dimensionality.

    Our UNet structure incorporates extra transformers within the center, and skips self-attention (SA) layers at increased resolutions.

    Convolution blocks, particularly ResNet blocks, are deployed at every degree of the UNet. While these blocks are instrumental for characteristic extraction and knowledge stream, the related computational prices, particularly at high-resolution ranges, could be substantial. One confirmed strategy on this context is separable convolution. We noticed that changing common convolution layers with light-weight separable convolution layers within the deeper segments of the UNet yields comparable efficiency.

    In the determine under, we evaluate the UNets of a number of diffusion fashions. Our MobileDiffusion reveals superior effectivity by way of FLOPs (floating-point operations) and variety of parameters.

    Comparison of some diffusion UNets.

    Image decoder

    In addition to the UNet, we additionally optimized the picture decoder. We educated a variational autoencoder (VAE) to encode an RGB picture to an 8-channel latent variable, with 8× smaller spatial measurement of the picture. A latent variable could be decoded to a picture and will get 8× bigger in measurement. To additional improve effectivity, we design a light-weight decoder structure by pruning the unique’s width and depth. The ensuing light-weight decoder results in a big efficiency increase, with almost 50% latency enchancment and higher high quality. For extra particulars, please seek advice from our paper.

    VAE reconstruction. Our VAE decoders have higher visible high quality than SD (Stable Diffusion).

    Decoder   #Params (M)     PSNR↑     SSIM↑     LPIPS↓  
    SD 49.5 26.7 0.76 0.037
    Ours 39.3 30.0 0.83 0.032
    Ours-Lite     9.8 30.2 0.84 0.032

    One-step sampling

    In addition to optimizing the mannequin structure, we undertake a DiffusionGAN hybrid to realize one-step sampling. Training DiffusionGAN hybrid fashions for text-to-image generation encounters a number of intricacies. Notably, the discriminator, a classifier distinguishing actual knowledge and generated knowledge, should make judgments primarily based on each texture and semantics. Moreover, the price of coaching text-to-image fashions could be extraordinarily excessive, significantly within the case of GAN-based fashions, the place the discriminator introduces further parameters. Purely GAN-based text-to-image fashions (e.g., StyleGAN-T, GigaGAN) confront comparable complexities, leading to extremely intricate and costly coaching.

    To overcome these challenges, we use a pre-trained diffusion UNet to initialize the generator and discriminator. This design allows seamless initialization with the pre-trained diffusion mannequin. We postulate that the interior options inside the diffusion mannequin comprise wealthy data of the intricate interaction between textual and visible knowledge. This initialization technique considerably streamlines the coaching.

    The determine under illustrates the coaching process. After initialization, a loud picture is shipped to the generator for one-step diffusion. The result’s evaluated in opposition to floor reality with a reconstruction loss, much like diffusion mannequin coaching. We then add noise to the output and ship it to the discriminator, whose result’s evaluated with a GAN loss, successfully adopting the GAN to mannequin a denoising step. By utilizing pre-trained weights to initialize the generator and the discriminator, the coaching turns into a fine-tuning course of, which converges in lower than 10K iterations.

    Illustration of DiffusionGAN fine-tuning.

    Results

    Below we present instance photos generated by our MobileDiffusion with DiffusionGAN one-step sampling. With such a compact mannequin (520M parameters in complete), MobileDiffusion can generate high-quality various photos for numerous domains.

    Images generated by our MobileDiffusion

    We measured the efficiency of our MobileDiffusion on each iOS and Android gadgets, utilizing totally different runtime optimizers. The latency numbers are reported under. We see that MobileDiffusion could be very environment friendly and might run inside half a second to generate a 512×512 picture. This lightning pace probably allows many fascinating use instances on cellular gadgets.

    Latency measurements (s) on cellular gadgets.

    Conclusion

    With superior effectivity by way of latency and measurement, MobileDiffusion has the potential to be a really pleasant choice for cellular deployments given its functionality to allow a fast picture generation expertise whereas typing textual content prompts. And we’ll guarantee any software of this expertise will probably be in-line with Google’s accountable AI practices.

    Acknowledgments

    We prefer to thank our collaborators and contributors that helped deliver MobileDiffusion to on-device: Zhisheng Xiao, Yanwu Xu, Jiuqiang Tang, Haolin Jia, Lutz Justen, Daniel Fenner, Ronald Wotzlaw, Jianing Wei, Raman Sarokin, Juhyun Lee, Andrei Kulik, Chuo-Ling Chang, and Matthias Grundmann.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    AI

    “Periodic table of machine learning” could fuel AI discovery | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    AI

    A flexible library for auditing differential privacy – Google Research Blog

    Posted by Mónica Ribero Díaz, Research Scientist, Google Research

    Mobile

    Big or little Zenfones, more people should have their hands on them

    Robert Triggs / Android Authority While final 12 months’s swirling rumors that ASUS’ Zenfone line…

    Science

    Research monkeys still having a ball days after busting out of lab, police say

    If you want any inspiration for reducing unfastened and stress-free this weekend, look no additional…

    AI

    How Does the UNet Encoder Transform Diffusion Models? This AI Paper Explores Its Impact on Image and Video Generation Speed and Quality

    Diffusion fashions characterize a cutting-edge strategy to picture technology, providing a dynamic framework for capturing…

    Crypto

    Crypto Money Laundering Plummets By 29% In Latest Chainalysis Findings

    According to a current report printed by crypto analytics agency Chainalysis, cash laundering involving crypto…

    Our Picks
    AI

    ChatGPT with Eyes and Ears: BuboGPT is an AI Approach That Enables Visual Grounding in Multi-Modal LLMs

    The Future

    iPhone 16 Pro Camera Takes on Xiaomi 14 Ultra: Let the Fun Begin

    Crypto

    UK Resident Looking For Job? NCA Ramps Up Recruitment for Crypto Crime Division

    Categories
    • AI (1,482)
    • Crypto (1,744)
    • Gadgets (1,796)
    • Mobile (1,839)
    • Science (1,853)
    • Technology (1,789)
    • The Future (1,635)
    Most Popular
    Mobile

    Amazon releases new ChatBot, Open AI drama winds down

    Crypto

    Tether had ‘record-breaking’ net profits in Q4, Polygon Labs does layoffs and hackers steal $112M of XRP

    The Future

    Apple WWDC 2024 Live Stream Online: Where and How to Watch?

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.