Close Menu
Ztoog
    What's Hot
    Gadgets

    13 Best Deals: Stand Mixers, Blenders, and More

    Crypto

    Exchange Supply Hits Lowest Level Since 2017

    Gadgets

    Wendy’s Launches Drone Delivery In Virginia

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

      India-Pak conflict: Pak appoints ISI chief, appointment comes in backdrop of the Pahalgam attack

    • Technology

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

      The more Google kills Fitbit, the more I want a Fitbit Sense 3

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

      ‘Dark photon’ theory of light aims to tear up a century of physics

    • AI

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

      The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    • Crypto

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

      Speak at Ztoog Disrupt 2025: Applications now open

    Ztoog
    Home » A large language model for zero-shot video generation – Google Research Blog
    AI

    A large language model for zero-shot video generation – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    A large language model for zero-shot video generation – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Dan Kondratyuk and David Ross, Software Engineers, Google Research

    A current wave of video generation fashions has burst onto the scene, in lots of instances showcasing beautiful picturesque high quality. One of the present bottlenecks in video generation is within the capability to provide coherent large motions. In many instances, even the present main fashions both generate small movement or, when producing bigger motions, exhibit noticeable artifacts.

    To discover the applying of language fashions in video generation, we introduce VideoPoet, a large language model (LLM) that’s able to all kinds of video generation duties, together with text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio. One notable remark is that the main video generation fashions are nearly solely diffusion-based (for one instance, see Imagen Video). On the opposite hand, LLMs are well known because the de facto normal as a result of their distinctive studying capabilities throughout varied modalities, together with language, code, and audio (e.g., AudioPaLM). In distinction to different fashions on this house, our strategy seamlessly integrates many video generation capabilities inside a single LLM, relatively than counting on individually skilled parts that specialize on every process.

    Overview

    The diagram under illustrates VideoPoet’s capabilities. Input photos will be animated to provide movement, and (optionally cropped or masked) video will be edited for inpainting or outpainting. For stylization, the model takes in a video representing the depth and optical circulate, which symbolize the movement, and paints contents on prime to provide the text-guided type.

    An overview of VideoPoet, able to multitasking on a wide range of video-centric inputs and outputs. The LLM can optionally take textual content as enter to information generation for text-to-video, image-to-video, video-to-audio, stylization, and outpainting duties. Resources used: Wikimedia Commons and DAVIS.

    Language fashions as video mills

    One key benefit of utilizing LLMs for coaching is that one can reuse lots of the scalable effectivity enhancements which were launched in current LLM coaching infrastructure. However, LLMs function on discrete tokens, which may make video generation difficult. Fortunately, there exist video and audio tokenizers, which serve to encode video and audio clips as sequences of discrete tokens (i.e., integer indices), and which will also be transformed again into the unique illustration.

    VideoPoet trains an autoregressive language model to be taught throughout video, picture, audio, and textual content modalities by way of the usage of a number of tokenizers (MAGVIT V2 for video and picture and SoundStream for audio). Once the model generates tokens conditioned on some context, these will be transformed again right into a viewable illustration with the tokenizer decoders.

    A detailed have a look at the VideoPoet process design, exhibiting the coaching and inference inputs and outputs of assorted duties. Modalities are transformed to and from tokens utilizing tokenizer encoder and decoders. Each modality is surrounded by boundary tokens, and a process token signifies the kind of process to carry out.

    Examples generated by VideoPoet

    Some examples generated by our model are proven under.

    Videos generated by VideoPoet from varied textual content prompts. For particular textual content prompts confer with the web site.

    For text-to-video, video outputs are variable size and may apply a spread of motions and types relying on the textual content content material. To guarantee accountable practices, we reference artworks and types within the public area e.g., Van Gogh’s “Starry Night”.

    Text Input    “A Raccoon dancing in Times Square”    “A horse galloping through Van-Gogh’s ‘Starry Night’”    “Two pandas playing cards”    “A large blob of exploding splashing rainbow paint, with an apple emerging, 8k”
    Video Output            

    For image-to-video, VideoPoet can take the enter picture and animate it with a immediate.

    An instance of image-to-video with textual content prompts to information the movement. Each video is paired with a picture to its left. Left: “A ship navigating the rough seas, thunderstorm and lightning, animated oil on canvas”. Middle: “Flying through a nebula with many twinkling stars”. Right: “A wanderer on a cliff with a cane looking down at the swirling sea fog below on a windy day”. Reference: Wikimedia Commons, public area**.

    For video stylization, we predict the optical circulate and depth data earlier than feeding into VideoPoet with some extra enter textual content.

    Examples of video stylization on prime of VideoPoet text-to-video generated movies with textual content prompts, depth, and optical circulate used as conditioning. The left video in every pair is the enter video, the appropriate is the stylized output. Left: “Wombat wearing sunglasses holding a beach ball on a sunny beach.” Middle: “Teddy bears ice skating on a crystal clear frozen lake.” Right: “A metal lion roaring in the light of a forge.”

    VideoPoet can be able to producing audio. Here we first generate 2-second clips from the model after which attempt to predict the audio with none textual content steering. This permits generation of video and audio from a single model.

    An instance of video-to-audio, producing audio from a video instance with none textual content enter.

    By default, the VideoPoet model generates movies in portrait orientation to tailor its output in direction of short-form content material. To showcase its capabilities, we’ve produced a short film composed of many brief clips generated by VideoPoet. For the script, we requested Bard to jot down a brief story a few touring raccoon with a scene-by-scene breakdown and a listing of accompanying prompts. We then generated video clips for every immediate, and stitched collectively all ensuing clips to provide the ultimate video under.

    When we developed VideoPoet, we seen some good properties of the model’s capabilities, which we spotlight under.

    Long video

    We are capable of generate longer movies just by conditioning on the final 1 second of video and predicting the following 1 second. By chaining this repeatedly, we present that the model cannot solely prolong the video properly but in addition faithfully protect the looks of all objects even over a number of iterations.

    Here are two examples of VideoPoet producing lengthy video from textual content enter:

    Text Input    “An astronaut starts dancing on Mars. Colorful fireworks then explode in the background.”    “FPV footage of a very sharp elven city of stone in the jungle with a brilliant blue river, waterfall, and large steep vertical cliff faces.”           
    Video Output                 

    It can be potential to interactively edit current video clips generated by VideoPoet. If we provide an enter video, we will change the movement of objects to carry out totally different actions. The object manipulation will be centered on the first body or the center frames, which permit for a excessive diploma of enhancing management.

    For instance, we will randomly generate some clips from the enter video and choose the specified subsequent clip.

    An enter video on the left is used as conditioning to generate 4 decisions given the preliminary immediate: “Closeup of an adorable rusty broken-down steampunk robot covered in moss moist and budding vegetation, surrounded by tall grass”. For the primary three outputs we present what would occur for unprompted motions. For the final video within the record under, we add to the immediate, “powering up with smoke in the background” to information the motion.

    Image to video management

    Similarly, we will apply movement to an enter picture to edit its contents in direction of the specified state, conditioned on a textual content immediate.

    Animating a portray with totally different prompts. Left: “A woman turning to look at the camera.” Right: “A woman yawning.” **

    Camera movement

    We may also precisely management digicam actions by appending the kind of desired digicam movement to the textual content immediate. As an instance, we generated a picture by our model with the immediate, “Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river”. The examples under append the given textual content suffix to use the specified movement.

    Prompts from left to proper: “Zoom out”, “Dolly zoom”, “Pan left”, “Arc shot”, “Crane shot”, “FPV drone shot”.

    Evaluation outcomes

    We consider VideoPoet on text-to-video generation with a wide range of benchmarks to match the outcomes to different approaches. To guarantee a impartial analysis, we ran all fashions on a large variation of prompts with out cherry-picking examples and requested folks to charge their preferences. The determine under highlights the share of the time VideoPoet was chosen as the popular choice in inexperienced for the next questions.

    Text constancy

    User choice scores for textual content constancy, i.e., what proportion of movies are most popular by way of precisely following a immediate.

    Motion interestingness

    User choice scores for movement interestingness, i.e., what proportion of movies are most popular by way of producing fascinating movement.

    Based on the above, on common folks chosen 24–35% of examples from VideoPoet as following prompts higher than a competing model vs. 8–11% for competing fashions. Raters additionally most popular 41–54% of examples from VideoPoet for extra fascinating movement than 11–21% for different fashions.

    Conclusion

    Through VideoPoet, we’ve demonstrated LLMs’ highly-competitive video generation high quality throughout all kinds of duties, particularly in producing fascinating and prime quality motions inside movies. Our outcomes recommend the promising potential of LLMs within the area of video generation. For future instructions, our framework ought to be capable of assist “any-to-any” generation, e.g., extending to text-to-audio, audio-to-video, and video captioning needs to be potential, amongst many others.

    To view extra examples in authentic high quality, see the web site demo.

    Acknowledgements

    This analysis has been supported by a large physique of contributors, together with Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, and Lu Jiang.

    We give particular because of Alex Siegman and Victor Gomes for managing computing assets. We additionally give because of Aren Jansen, Marco Tagliasacchi, Neil Zeghidour, John Hershey for audio tokenization and processing, Angad Singh for storyboarding in “Rookie the Raccoon”, Cordelia Schmid for analysis discussions, Alonso Martinez for graphic design, David Salesin, Tomas Izo, and Rahul Sukthankar for their assist, and Jay Yagnik as architect of the preliminary idea.

    **

    (a) The Storm on the Sea of Galilee, by Rembrandt 1633, public area.

    (b) Pillars of Creation, by NASA 2014, public area.

    (c) Wanderer above the Sea of Fog, by Caspar David Friedrich, 1818, public area

    (d) Mona Lisa, by Leonardo Da Vinci, 1503, public area.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    AI

    “Periodic table of machine learning” could fuel AI discovery | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    Crypto Whale Loses $500K+ Speculating on Pepe Memecoin

    Key Takeaways PEPE memecoin was listed on Binance on Friday and went from $0.00000420 to…

    Science

    New study: There are lots of icy super-Earths

    What does the “typical” exosolar system appear like? We know it isn’t more likely to…

    Crypto

    Ex-Meta employees’ Aptos tests Hong Kong’s crypto appetite

    Ever since Hong Kong legalized cryptocurrency buying and selling final June, blockchain tasks from the…

    Gadgets

    Sonos finally made some headphones

    Sonos lately teased that it was set to announce its “most requested product ever.” If…

    Science

    Mysterious object may be the lightest black hole we’ve ever seen

    An artist’s impression of a pulsar orbiting a black hole – one potential interpretation of…

    Our Picks
    AI

    Researchers use large language models to help robots navigate | Ztoog

    AI

    Researchers at Stanford Introduce Spellburst: A Large Language Model (LLM) Powered Creative-Coding Environment

    Gadgets

    Somehow This $10,000 Flame-Thrower Robot Dog Is Completely Legal in 48 States

    Categories
    • AI (1,482)
    • Crypto (1,744)
    • Gadgets (1,796)
    • Mobile (1,839)
    • Science (1,853)
    • Technology (1,789)
    • The Future (1,635)
    Most Popular
    The Future

    Bumble’s new CEO talks about her critical mission: to spice things up at the company

    AI

    Revolutionizing Scene Reconstruction with Break-A-Scene: The Future of AI-Powered Object Extraction and Remixing

    Crypto

    Bitwise Reveals Two Triggers That Will Send Bitcoin To $80,000

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.