A large language model for zero-shot video generation

Posted by Dan Kondratyuk and David Ross, Software Engineers, Google Research

A current wave of video generation fashions has burst onto the scene, in lots of instances showcasing beautiful picturesque high quality. One of the present bottlenecks in video generation is within the capability to provide coherent large motions. In many instances, even the present main fashions both generate small movement or, when producing bigger motions, exhibit noticeable artifacts.

To discover the applying of language fashions in video generation, we introduce VideoPoet, a large language model (LLM) that’s able to all kinds of video generation duties, together with text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio. One notable remark is that the main video generation fashions are nearly solely diffusion-based (for one instance, see Imagen Video). On the opposite hand, LLMs are well known because the de facto normal as a result of their distinctive studying capabilities throughout varied modalities, together with language, code, and audio (e.g., AudioPaLM). In distinction to different fashions on this house, our strategy seamlessly integrates many video generation capabilities inside a single LLM, relatively than counting on individually skilled parts that specialize on every process.

Overview

The diagram under illustrates VideoPoet’s capabilities. Input photos will be animated to provide movement, and (optionally cropped or masked) video will be edited for inpainting or outpainting. For stylization, the model takes in a video representing the depth and optical circulate, which symbolize the movement, and paints contents on prime to provide the text-guided type.

An overview of VideoPoet, able to multitasking on a wide range of video-centric inputs and outputs. The LLM can optionally take textual content as enter to information generation for text-to-video, image-to-video, video-to-audio, stylization, and outpainting duties. Resources used: Wikimedia Commons and DAVIS.

Language fashions as video mills

One key benefit of utilizing LLMs for coaching is that one can reuse lots of the scalable effectivity enhancements which were launched in current LLM coaching infrastructure. However, LLMs function on discrete tokens, which may make video generation difficult. Fortunately, there exist video and audio tokenizers, which serve to encode video and audio clips as sequences of discrete tokens (i.e., integer indices), and which will also be transformed again into the unique illustration.

VideoPoet trains an autoregressive language model to be taught throughout video, picture, audio, and textual content modalities by way of the usage of a number of tokenizers (MAGVIT V2 for video and picture and SoundStream for audio). Once the model generates tokens conditioned on some context, these will be transformed again right into a viewable illustration with the tokenizer decoders.

A detailed have a look at the VideoPoet process design, exhibiting the coaching and inference inputs and outputs of assorted duties. Modalities are transformed to and from tokens utilizing tokenizer encoder and decoders. Each modality is surrounded by boundary tokens, and a process token signifies the kind of process to carry out.

Examples generated by VideoPoet

Some examples generated by our model are proven under.

Videos generated by VideoPoet from varied textual content prompts. For particular textual content prompts confer with the web site.

For text-to-video, video outputs are variable size and may apply a spread of motions and types relying on the textual content content material. To guarantee accountable practices, we reference artworks and types within the public area e.g., Van Gogh’s “Starry Night”.

Text Input		“A Raccoon dancing in Times Square”		“A horse galloping through Van-Gogh’s ‘Starry Night’”		“Two pandas playing cards”		“A large blob of exploding splashing rainbow paint, with an apple emerging, 8k”
Video Output

For image-to-video, VideoPoet can take the enter picture and animate it with a immediate.

An instance of image-to-video with textual content prompts to information the movement. Each video is paired with a picture to its left. Left: “A ship navigating the rough seas, thunderstorm and lightning, animated oil on canvas”. Middle: “Flying through a nebula with many twinkling stars”. Right: “A wanderer on a cliff with a cane looking down at the swirling sea fog below on a windy day”. Reference: Wikimedia Commons, public area**.

For video stylization, we predict the optical circulate and depth data earlier than feeding into VideoPoet with some extra enter textual content.

Examples of video stylization on prime of VideoPoet text-to-video generated movies with textual content prompts, depth, and optical circulate used as conditioning. The left video in every pair is the enter video, the appropriate is the stylized output. Left: “Wombat wearing sunglasses holding a beach ball on a sunny beach.” Middle: “Teddy bears ice skating on a crystal clear frozen lake.” Right: “A metal lion roaring in the light of a forge.”

VideoPoet can be able to producing audio. Here we first generate 2-second clips from the model after which attempt to predict the audio with none textual content steering. This permits generation of video and audio from a single model.

An instance of video-to-audio, producing audio from a video instance with none textual content enter.

By default, the VideoPoet model generates movies in portrait orientation to tailor its output in direction of short-form content material. To showcase its capabilities, we’ve produced a short film composed of many brief clips generated by VideoPoet. For the script, we requested Bard to jot down a brief story a few touring raccoon with a scene-by-scene breakdown and a listing of accompanying prompts. We then generated video clips for every immediate, and stitched collectively all ensuing clips to provide the ultimate video under.

When we developed VideoPoet, we seen some good properties of the model’s capabilities, which we spotlight under.

Long video

We are capable of generate longer movies just by conditioning on the final 1 second of video and predicting the following 1 second. By chaining this repeatedly, we present that the model cannot solely prolong the video properly but in addition faithfully protect the looks of all objects even over a number of iterations.

Here are two examples of VideoPoet producing lengthy video from textual content enter:

Text Input		“An astronaut starts dancing on Mars. Colorful fireworks then explode in the background.”		“FPV footage of a very sharp elven city of stone in the jungle with a brilliant blue river, waterfall, and large steep vertical cliff faces.”
Video Output

It can be potential to interactively edit current video clips generated by VideoPoet. If we provide an enter video, we will change the movement of objects to carry out totally different actions. The object manipulation will be centered on the first body or the center frames, which permit for a excessive diploma of enhancing management.

For instance, we will randomly generate some clips from the enter video and choose the specified subsequent clip.

An enter video on the left is used as conditioning to generate 4 decisions given the preliminary immediate: “Closeup of an adorable rusty broken-down steampunk robot covered in moss moist and budding vegetation, surrounded by tall grass”. For the primary three outputs we present what would occur for unprompted motions. For the final video within the record under, we add to the immediate, “powering up with smoke in the background” to information the motion.

Image to video management

Similarly, we will apply movement to an enter picture to edit its contents in direction of the specified state, conditioned on a textual content immediate.

Animating a portray with totally different prompts. Left: “A woman turning to look at the camera.” Right: “A woman yawning.” **

Camera movement

We may also precisely management digicam actions by appending the kind of desired digicam movement to the textual content immediate. As an instance, we generated a picture by our model with the immediate, “Adventure game concept art of a sunrise over a snowy mountain by a crystal clear river”. The examples under append the given textual content suffix to use the specified movement.

Prompts from left to proper: “Zoom out”, “Dolly zoom”, “Pan left”, “Arc shot”, “Crane shot”, “FPV drone shot”.

Evaluation outcomes

We consider VideoPoet on text-to-video generation with a wide range of benchmarks to match the outcomes to different approaches. To guarantee a impartial analysis, we ran all fashions on a large variation of prompts with out cherry-picking examples and requested folks to charge their preferences. The determine under highlights the share of the time VideoPoet was chosen as the popular choice in inexperienced for the next questions.

Text constancy

User choice scores for textual content constancy, i.e., what proportion of movies are most popular by way of precisely following a immediate.

Motion interestingness

User choice scores for movement interestingness, i.e., what proportion of movies are most popular by way of producing fascinating movement.

Based on the above, on common folks chosen 24–35% of examples from VideoPoet as following prompts higher than a competing model vs. 8–11% for competing fashions. Raters additionally most popular 41–54% of examples from VideoPoet for extra fascinating movement than 11–21% for different fashions.

Conclusion

Through VideoPoet, we’ve demonstrated LLMs’ highly-competitive video generation high quality throughout all kinds of duties, particularly in producing fascinating and prime quality motions inside movies. Our outcomes recommend the promising potential of LLMs within the area of video generation. For future instructions, our framework ought to be capable of assist “any-to-any” generation, e.g., extending to text-to-audio, audio-to-video, and video captioning needs to be potential, amongst many others.

To view extra examples in authentic high quality, see the web site demo.

Acknowledgements

This analysis has been supported by a large physique of contributors, together with Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, and Lu Jiang.

We give particular because of Alex Siegman and Victor Gomes for managing computing assets. We additionally give because of Aren Jansen, Marco Tagliasacchi, Neil Zeghidour, John Hershey for audio tokenization and processing, Angad Singh for storyboarding in “Rookie the Raccoon”, Cordelia Schmid for analysis discussions, Alonso Martinez for graphic design, David Salesin, Tomas Izo, and Rahul Sukthankar for their assist, and Jay Yagnik as architect of the preliminary idea.

**

(a) The Storm on the Sea of Galilee, by Rembrandt 1633, public area.

(b) Pillars of Creation, by NASA 2014, public area.

(c) Wanderer above the Sea of Fog, by Caspar David Friedrich, 1818, public area

(d) Mona Lisa, by Leonardo Da Vinci, 1503, public area.

What's Hot

Important Pages:

A large language model for zero-shot video generation – Google Research Blog