Close Menu
Ztoog
    What's Hot
    AI

    A framework for health equity assessment of machine learning performance – Google Research Blog

    Crypto

    Robinhood is on a quest to dive deeper into crypto

    Crypto

    Steve Cohen-backed NFT Platform Recur to Close Doors After Raising $50M Just 2 Years Ago

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

      LiberNovo Omni: The World’s First Dynamic Ergonomic Chair

    • Technology

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

      5 Skills Kids (and Adults) Need in an AI World – O’Reilly

    • Gadgets

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

    • Mobile

      Deals: the Galaxy S25 series comes with a free tablet, Google Pixels heavily discounted

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

    • Science

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

      A trip to the farm where loofahs grow on vines

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

    Ztoog
    Home » A large language model for zero-shot video generation – Google Research Blog
    AI

    A large language model for zero-shot video generation – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    A large language model for zero-shot video generation – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Dan Kondratyuk and David Ross, Software Engineers, Google Research

    A current wave of video generation fashions has burst onto the scene, in lots of instances showcasing beautiful picturesque high quality. One of the present bottlenecks in video generation is within the capability to provide coherent large motions. In many instances, even the present main fashions both generate small movement or, when producing bigger motions, exhibit noticeable artifacts.

    To discover the applying of language fashions in video generation, we introduce VideoPoet, a large language model (LLM) that’s able to all kinds of video generation duties, together with text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio. One notable remark is that the main video generation fashions are nearly solely diffusion-based (for one instance, see Imagen Video). On the opposite hand, LLMs are well known because the de facto normal as a result of their distinctive studying capabilities throughout varied modalities, together with language, code, and audio (e.g., AudioPaLM). In distinction to different fashions on this house, our strategy seamlessly integrates many video generation capabilities inside a single LLM, relatively than counting on individually skilled parts that specialize on every process.

    Overview

    The diagram under illustrates VideoPoet’s capabilities. Input photos will be animated to provide movement, and (optionally cropped or masked) video will be edited for inpainting or outpainting. For stylization, the model takes in a video representing the depth and optical circulate, which symbolize the movement, and paints contents on prime to provide the text-guided type.

    An overview of VideoPoet, able to multitasking on a wide range of video-centric inputs and outputs. The LLM can optionally take textual content as enter to information generation for text-to-video, image-to-video, video-to-audio, stylization, and outpainting duties. Resources used: Wikimedia Commons and DAVIS.

    Language fashions as video mills

    One key benefit of utilizing LLMs for coaching is that one can reuse lots of the scalable effectivity enhancements which were launched in current LLM coaching infrastructure. However, LLMs function on discrete tokens, which may make video generation difficult. Fortunately, there exist video and audio tokenizers, which serve to encode video and audio clips as sequences of discrete tokens (i.e., integer indices), and which will also be transformed again into the unique illustration.

    VideoPoet trains an autoregressive language model to be taught throughout video, picture, audio, and textual content modalities by way of the usage of a number of tokenizers (MAGVIT V2 for video and picture and SoundStream for audio). Once the model generates tokens conditioned on some context, these will be transformed again right into a viewable illustration with the tokenizer decoders.

    A detailed have a look at the VideoPoet process design, exhibiting the coaching and inference inputs and outputs of assorted duties. Modalities are transformed to and from tokens utilizing tokenizer encoder and decoders. Each modality is surrounded by boundary tokens, and a process token signifies the kind of process to carry out.

    Examples generated by VideoPoet

    Some examples generated by our model are proven under.

    Videos generated by VideoPoet from varied textual content prompts. For particular textual content prompts confer with the web site.

    For text-to-video, video outputs are variable size and may apply a spread of motions and types relying on the textual content content material. To guarantee accountable practices, we reference artworks and types within the public area e.g., Van Gogh’s “Starry Night”.

    Text Input    “A Raccoon dancing in Times Square”    “A horse galloping through Van-Gogh’s ‘Starry Night’”    “Two pandas playing cards”    “A large blob of exploding splashing rainbow paint, with an apple emerging, 8k”
    Video Output            

    For image-to-video, VideoPoet can take the enter picture and animate it with a immediate.

    An instance of image-to-video with textual content prompts to information the movement. Each video is paired with a picture to its left. Left: “A ship navigating the rough seas, thunderstorm and lightning, animated oil on canvas”. Middle: “Flying through a nebula with many twinkling stars”. Right: “A wanderer on a cliff with a cane looking down at the swirling sea fog below on a windy day”. Reference: Wikimedia Commons, public area**.

    For video stylization, we predict the optical circulate and depth data earlier than feeding into VideoPoet with some extra enter textual content.

    Examples of video stylization on prime of VideoPoet text-to-video generated movies with textual content prompts, depth, and optical circulate used as conditioning. The left video in every pair is the enter video, the appropriate is the stylized output. Left: “Wombat wearing sunglasses holding a beach ball on a sunny beach.” Middle: “Teddy bears ice skating on a crystal clear frozen lake.” Right: “A metal lion roaring in the light of a forge.”

    VideoPoet can be able to producing audio. Here we first generate 2-second clips from the model after which attempt to predict the audio with none textual content steering. This permits generation of video and audio from a single model.

    An instance of video-to-audio, producing audio from a video instance with none textual content enter.

    By default, the VideoPoet model generates movies in portrait orientation to tailor its output in direction of short-form content material. To showcase its capabilities, we’ve produced a short film composed of many brief clips generated by VideoPoet. For the script, we requested Bard to jot down a brief story a few touring raccoon with a scene-by-scene breakdown and a listing of accompanying prompts. We then generated video clips for every immediate, and stitched collectively all ensuing clips to provide the ultimate video under.

    When we developed VideoPoet, we seen some good properties of the model’s capabilities, which we spotlight under.

    Long video

    We are capable of generate longer movies just by conditioning on the final 1 second of video and predicting the following 1 second. By chaining this repeatedly, we present that the model cannot solely prolong the video properly but in addition faithfully protect the looks of all objects even over a number of iterations.

    Here are two examples of VideoPoet producing lengthy video from textual content enter:

    Text Input    “An astronaut starts dancing on Mars. Colorful fireworks then explode in the background.”    “FPV footage of a very sharp elven city of stone in the jungle with a brilliant blue river, waterfall, and large steep vertical cliff faces.”           
    Video Output                 

    It can be potential to interactively edit current video clips generated by VideoPoet. If we provide an enter video, we will change the movement of objects to carry out totally different actions. The object manipulation will be centered on the first body or the center frames, which permit for a excessive diploma of enhancing management.

    For instance, we will randomly generate some clips from the enter video and choose the specified subsequent clip.

    An enter video on the left is used as conditioning to generate 4 decisions given the preliminary immediate: “Closeup of an adorable rusty broken-down steampunk robot covered in moss moist and budding vegetation, surrounded by tall grass”. For the primary three outputs we present what would occur for unprompted motions. For the final video within the record under, we add to the immediate, “powering up with smoke in the background” to information the motion.

    Image to video management

    Similarly, we will apply movement to an enter picture to edit its contents in direction of the specified state, conditioned on a textual content immediate.

    Animating a portray with totally different prompts. Left: “A woman turning to look at the camera.” Right: “A woman yawning.” **

    Camera movement

    We may also precisely management digicam actions by appending the kind of desired digicam movement to the textual content immediate. As an instance, we generated a picture by our model with the immediate, “Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river”. The examples under append the given textual content suffix to use the specified movement.

    Prompts from left to proper: “Zoom out”, “Dolly zoom”, “Pan left”, “Arc shot”, “Crane shot”, “FPV drone shot”.

    Evaluation outcomes

    We consider VideoPoet on text-to-video generation with a wide range of benchmarks to match the outcomes to different approaches. To guarantee a impartial analysis, we ran all fashions on a large variation of prompts with out cherry-picking examples and requested folks to charge their preferences. The determine under highlights the share of the time VideoPoet was chosen as the popular choice in inexperienced for the next questions.

    Text constancy

    User choice scores for textual content constancy, i.e., what proportion of movies are most popular by way of precisely following a immediate.

    Motion interestingness

    User choice scores for movement interestingness, i.e., what proportion of movies are most popular by way of producing fascinating movement.

    Based on the above, on common folks chosen 24–35% of examples from VideoPoet as following prompts higher than a competing model vs. 8–11% for competing fashions. Raters additionally most popular 41–54% of examples from VideoPoet for extra fascinating movement than 11–21% for different fashions.

    Conclusion

    Through VideoPoet, we’ve demonstrated LLMs’ highly-competitive video generation high quality throughout all kinds of duties, particularly in producing fascinating and prime quality motions inside movies. Our outcomes recommend the promising potential of LLMs within the area of video generation. For future instructions, our framework ought to be capable of assist “any-to-any” generation, e.g., extending to text-to-audio, audio-to-video, and video captioning needs to be potential, amongst many others.

    To view extra examples in authentic high quality, see the web site demo.

    Acknowledgements

    This analysis has been supported by a large physique of contributors, together with Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, and Lu Jiang.

    We give particular because of Alex Siegman and Victor Gomes for managing computing assets. We additionally give because of Aren Jansen, Marco Tagliasacchi, Neil Zeghidour, John Hershey for audio tokenization and processing, Angad Singh for storyboarding in “Rookie the Raccoon”, Cordelia Schmid for analysis discussions, Alonso Martinez for graphic design, David Salesin, Tomas Izo, and Rahul Sukthankar for their assist, and Jay Yagnik as architect of the preliminary idea.

    **

    (a) The Storm on the Sea of Galilee, by Rembrandt 1633, public area.

    (b) Pillars of Creation, by NASA 2014, public area.

    (c) Wanderer above the Sea of Fog, by Caspar David Friedrich, 1818, public area

    (d) Mona Lisa, by Leonardo Da Vinci, 1503, public area.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    The Future

    Eero’s latest business outfits rentals with built-in mesh Wi-Fi

    Eero may convey its simply managed mesh routers to landlord-managed Wi-Fi networks in your subsequent…

    Technology

    SpaceX making more than 1,000 changes to next Starship rocket

    Enlarge / The higher stage for SpaceX’s next Starship take a look at struggle, named…

    AI

    Automated system teaches users when to collaborate with an AI assistant | Ztoog

    Artificial intelligence fashions that pick patterns in pictures can usually accomplish that higher than human…

    The Future

    AI discovers new class of antibiotics to kill drug-resistant bacteria

    Methicillin-resistant Staphylococcus aureus (MRSA)Shutterstock / Kateryna Kon Artificial intelligence has helped uncover a new class…

    Gadgets

    MediaTek Unveils New Filogic 860 and 360 WIFI 7 Chips

    MediaTek has introduced two new merchandise at their MediaTek Executive Summit 2023: the MediaTek Filogic…

    Our Picks
    The Future

    Best Keto Meal Delivery Services of 2024

    AI

    Meet Modular Diffusion: A Python Library for Designing and Training Diffusion Models with PyTorch

    The Future

    Bridging the Gap Between Fiat and Crypto

    Categories
    • AI (1,493)
    • Crypto (1,753)
    • Gadgets (1,805)
    • Mobile (1,851)
    • Science (1,866)
    • Technology (1,802)
    • The Future (1,648)
    Most Popular
    AI

    Inside the messy ethics of making war with machines

    Technology

    Katy Perry, Lauren Sanchez Landed Safely — Then Came the Memes

    Gadgets

    Parallels Desktop 19 gets Sonoma-ready, expands OpenGL and Linux support

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.