Close Menu
Ztoog
    What's Hot
    AI

    Top AI-Based Art Inpainting Tools

    Crypto

    Ethereum In Bleak Situation? Sharks & Whales Continue 4-Month Long Selloff

    The Future

    EcoFlow DELTA 2 Max is a great way to manage offline, or off-grid

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Can work-life balance tracking improve well-being?

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

    • Technology

      Elon Musk tries to stick to spaceships

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

    • Gadgets

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

    • Mobile

      Deals: the Galaxy S25 series comes with a free tablet, Google Pixels heavily discounted

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

    • Science

      June skygazing: A strawberry moon, the summer solstice… and Asteroid Day!

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      Bitcoin Maxi Isn’t Buying Hype Around New Crypto Holding Firms

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

    Ztoog
    Home » A foundational visual encoder for video understanding – Google Research Blog
    AI

    A foundational visual encoder for video understanding – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    A foundational visual encoder for video understanding – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Long Zhao, Senior Research Scientist, and Ting Liu, Senior Staff Software Engineer, Google Research

    An astounding variety of movies can be found on the Web, masking a wide range of content material from on a regular basis moments folks share to historic moments to scientific observations, every of which comprises a singular file of the world. The proper instruments might assist researchers analyze these movies, remodeling how we perceive the world round us.

    Videos supply dynamic visual content material much more wealthy than static pictures, capturing motion, modifications, and dynamic relationships between entities. Analyzing this complexity, together with the immense range of publicly out there video knowledge, calls for fashions that transcend conventional picture understanding. Consequently, lots of the approaches that finest carry out on video understanding nonetheless depend on specialised fashions tailored for explicit duties. Recently, there was thrilling progress on this space utilizing video basis fashions (ViFMs), similar to VideoCLIP, InternVideo, VideoCoCa, and UMT). However, constructing a ViFM that handles the sheer range of video knowledge stays a problem.

    With the purpose of constructing a single mannequin for general-purpose video understanding, we launched “VideoPrism: A Foundational Visual Encoder for Video Understanding”. VideoPrism is a ViFM designed to deal with a large spectrum of video understanding duties, together with classification, localization, retrieval, captioning, and query answering (QA). We suggest improvements in each the pre-training knowledge in addition to the modeling technique. We pre-train VideoPrism on a large and various dataset: 36 million high-quality video-text pairs and 582 million video clips with noisy or machine-generated parallel textual content. Our pre-training strategy is designed for this hybrid knowledge, to be taught each from video-text pairs and the movies themselves. VideoPrism is extremely simple to adapt to new video understanding challenges, and achieves state-of-the-art efficiency utilizing a single frozen mannequin.

    VideoPrism is a general-purpose video encoder that permits state-of-the-art outcomes over a large spectrum of video understanding duties, together with classification, localization, retrieval, captioning, and query answering, by producing video representations from a single frozen mannequin.

    Pre-training knowledge

    A highly effective ViFM wants a really massive assortment of movies on which to coach — much like different basis fashions (FMs), similar to these for massive language fashions (LLMs). Ideally, we might need the pre-training knowledge to be a consultant pattern of all of the movies on this planet. While naturally most of those movies wouldn’t have good captions or descriptions, even imperfect textual content can present helpful details about the semantic content material of the video.

    To give our mannequin the very best place to begin, we put collectively a large pre-training corpus consisting of a number of private and non-private datasets, together with YT-Temporal-180M, InternVid, VideoCC, WTS-70M, and many others. This contains 36 million rigorously chosen movies with high-quality captions, together with a further 582 million clips with various ranges of noisy textual content (like auto-generated transcripts). To our data, that is the most important and most various video coaching corpus of its type.

    Statistics on the video-text pre-training knowledge. The massive variations of the CLIP similarity scores (the upper, the higher) show the various caption high quality of our pre-training knowledge, which is a byproduct of the assorted methods used to reap the textual content.

    Two-stage coaching

    The VideoPrism mannequin structure stems from the usual imaginative and prescient transformer (ViT) with a factorized design that sequentially encodes spatial and temporal info following ViViT. Our coaching strategy leverages each the high-quality video-text knowledge and the video knowledge with noisy textual content talked about above. To begin, we use contrastive studying (an strategy that minimizes the space between optimistic video-text pairs whereas maximizing the space between destructive video-text pairs) to show our mannequin to match movies with their very own textual content descriptions, together with imperfect ones. This builds a basis for matching semantic language content material to visual content material.

    After video-text contrastive coaching, we leverage the gathering of movies with out textual content descriptions. Here, we construct on the masked video modeling framework to foretell masked patches in a video, with a couple of enhancements. We practice the mannequin to foretell each the video-level world embedding and token-wise embeddings from the first-stage mannequin to successfully leverage the data acquired in that stage. We then randomly shuffle the anticipated tokens to forestall the mannequin from studying shortcuts.

    What is exclusive about VideoPrism’s setup is that we use two complementary pre-training indicators: textual content descriptions and the visual content material inside a video. Text descriptions usually give attention to what issues appear like, whereas the video content material offers details about motion and visual dynamics. This permits VideoPrism to excel in duties that demand an understanding of each look and movement.

    Results

    We carried out intensive analysis on VideoPrism throughout 4 broad classes of video understanding duties, together with video classification and localization, video-text retrieval, video captioning, query answering, and scientific video understanding. VideoPrism achieves state-of-the-art efficiency on 30 out of 33 video understanding benchmarks — all with minimal adaptation of a single, frozen mannequin.

    VideoPrism in comparison with the earlier best-performing FMs.

    Classification and localization

    We consider VideoPrism on an current large-scale video understanding benchmark (VideoGLUE) masking classification and localization duties. We discovered that (1) VideoPrism outperforms all the different state-of-the-art FMs, and (2) no different single mannequin persistently got here in second place. This tells us that VideoPrism has discovered to successfully pack a wide range of video indicators into one encoder — from semantics at totally different granularities to look and movement cues — and it really works effectively throughout a wide range of video sources.

    Combining with LLMs

    We additional discover combining VideoPrism with LLMs to unlock its capacity to deal with varied video-language duties. In explicit, when paired with a textual content encoder (following LiT) or a language decoder (similar to PaLM-2), VideoPrism will be utilized for video-text retrieval, video captioning, and video QA duties. We evaluate the mixed fashions on a broad and difficult set of vision-language benchmarks. VideoPrism units the brand new cutting-edge on most benchmarks. From the visual outcomes, we discover that VideoPrism is able to understanding advanced motions and appearances in movies (e.g., the mannequin can acknowledge the totally different colours of spinning objects on the window within the visual examples under). These outcomes show that VideoPrism is strongly suitable with language fashions.



    We present qualitative outcomes utilizing VideoPrism with a textual content encoder for video-text retrieval (first row) and tailored to a language decoder for video QA (second and third row). For video-text retrieval examples, the blue bars point out the embedding similarities between the movies and the textual content queries.

    Scientific functions

    Finally, we examined VideoPrism on datasets utilized by scientists throughout domains, together with fields similar to ethology, behavioral neuroscience, and ecology. These datasets sometimes require area experience to annotate, for which we leverage current scientific datasets open-sourced by the neighborhood together with Fly vs. Fly, CalMS21, ChimpACT, and KABR. VideoPrism not solely performs exceptionally effectively, however really surpasses fashions designed particularly for these duties. This suggests instruments like VideoPrism have the potential to rework how scientists analyze video knowledge throughout totally different fields.

    VideoPrism outperforms the area specialists on varied scientific benchmarks. We present absolutely the rating variations to spotlight the relative enhancements of VideoPrism. We report imply common precision (mAP) for all datasets, besides for KABR which makes use of class-averaged top-1 accuracy.

    Conclusion

    With VideoPrism, we introduce a strong and versatile video encoder that units a brand new commonplace for general-purpose video understanding. Our emphasis on each constructing a large and diversified pre-training dataset and revolutionary modeling strategies has been validated by means of our intensive evaluations. Not solely does VideoPrism persistently outperform sturdy baselines, however its distinctive capacity to generalize positions it effectively for tackling an array of real-world functions. Because of its potential broad use, we’re dedicated to persevering with additional accountable analysis on this house, guided by our AI Principles. We hope VideoPrism paves the way in which for future breakthroughs on the intersection of AI and video evaluation, serving to to appreciate the potential of ViFMs throughout domains similar to scientific discovery, training, and healthcare.

    Acknowledgements

    This weblog submit is made on behalf of all of the VideoPrism authors: Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. We sincerely thank David Hendon for their product administration efforts, and Alex Siegman, Ramya Ganeshan, and Victor Gomes for their program and useful resource administration efforts. We additionally thank Hassan Akbari, Sherry Ben, Yoni Ben-Meshulam, Chun-Te Chu, Sam Clearwater, Yin Cui, Ilya Figotin, Anja Hauth, Sergey Ioffe, Xuhui Jia, Yeqing Li, Lu Jiang, Zu Kim, Dan Kondratyuk, Bill Mark, Arsha Nagrani, Caroline Pantofaru, Sushant Prakash, Cordelia Schmid, Bryan Seybold, Mojtaba Seyedhosseini, Amanda Sadler, Rif A. Saurous, Rachel Stigler, Paul Voigtlaender, Pingmei Xu, Chaochao Yan, Xuan Yang, and Yukun Zhu for the discussions, help, and suggestions that significantly contributed to this work. We are grateful to Jay Yagnik, Rahul Sukthankar, and Tomas Izo for their enthusiastic help for this venture. Lastly, we thank Tom Small, Jennifer J. Sun, Hao Zhou, Nitesh B. Gundavarapu, Luke Friedman, and Mikhail Sirotenko for the super assist with making this weblog submit.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    AI

    Understanding our place in the universe | Ztoog

    Brian Nord first fell in love with physics when he was an adolescent rising up…

    Science

    X-ray imaging of The Night Watch reveals previously unknown lead layer

    Enlarge / Rembrandt’s The Night Watch underwent many chemical and mechanical alterations during the last…

    Gadgets

    20 Best Valentine’s Day Deals (2024): Headphones, Flowers, and a Lego Kit

    Whether you are buying to your important different, your child, or your self, Valentine’s Day…

    Technology

    SpaceX making more than 1,000 changes to next Starship rocket

    Enlarge / The higher stage for SpaceX’s next Starship take a look at struggle, named…

    AI

    AI “godfather” Yoshua Bengio joins UK project to prevent AI catastrophes

    Safeguarded AI’s objective is to construct AI programs that may provide quantitative ensures, akin to…

    Our Picks
    AI

    Top Predictive Analytics Tools/Platforms (2023)

    Crypto

    Analyst Thinks Ethereum Will Explode To $15,000, Cites Favorable Technical Formation

    Mobile

    Will Samsung Galaxy Z Flip 4 cases fit the Samsung Galaxy Z Flip 5?

    Categories
    • AI (1,493)
    • Crypto (1,754)
    • Gadgets (1,805)
    • Mobile (1,851)
    • Science (1,867)
    • Technology (1,803)
    • The Future (1,649)
    Most Popular
    AI

    DeepMind’s cofounder: Generative AI is just a phase. What’s next is interactive AI.

    The Future

    Game of Thrones’ Lena Headey Goes to Space

    AI

    This driverless car company is using chatbots to make its vehicles smarter

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.