Close Menu
Ztoog
    What's Hot
    Mobile

    Samsung Galaxy Z Flip 5 review: Bigger really is better

    Science

    Physics confirms the best way to make a playground swing go higher

    Technology

    Internet Archive forced to remove 500,000 books after publishers’ court win

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Disneyland’s 70th Anniversary Brings Cartoony Chaos to This Summer’s Celebration

      Story of military airfield in Afghanistan that Biden left in 2021

      Tencent hires WizardLM team, a Microsoft AI group with an odd history

      Today’s NYT Connections Hints, Answers for May 12, #701

      OPPO launches A5 Pro 5G: Premium features at a budget price

    • Technology

      Deep dive on the evolution of Microsoft's relationship with OpenAI, from its $1B investment in 2019 through Copilot rollouts and ChatGPT's launch to present day (Bloomberg)

      New leak reveals iPhone Fold won’t look like the Galaxy Z Fold 6 at all

      Apple will use AI and user data in iOS 19 to extend iPhone battery life

      Today’s NYT Wordle Hints, Answer and Help for May 12, #1423

      What It Is and Why It Matters—Part 1 – O’Reilly

    • Gadgets

      We Hand-Picked the 24 Best Deals From the 2025 REI Anniversary Sale

      “Google wanted that”: Nextcloud decries Android permissions as “gatekeeping”

      Google Tests Automatic Password-to-Passkey Conversion On Android

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

    • Mobile

      The iPhone Fold is now being tested with an under-display camera

      T-Mobile takes over one of golf’s biggest events, unleashes unique experiences

      Fitbit’s AI experiments just leveled up with 3 new health tracking features

      Motorola’s Moto Watch needs to start living up to the brand name

      Samsung Galaxy S25 Edge promo materials leak

    • Science

      Do these Buddhist gods hint at the purpose of China’s super-secret satellites?

      From Espresso to Eco-Brick: How Coffee Waste Fuels 3D-Printed Design

      Ancient three-eyed ‘sea moth’ used its butt to breathe

      Intelligence on Earth Evolved Independently at Least Twice

      Nothing is stronger than quantum connections – and now we know why

    • AI

      Google DeepMind’s new AI agent cracks real-world problems better than humans can

      Study shows vision-language models can’t handle queries with negation words | Ztoog

      How a new type of AI is helping police skirt facial recognition bans

      Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

      How to build a better AI benchmark

    • Crypto

      Is Bitcoin Bull Run Back? Daily RSI Shows Only Mild Bullish Momentum

      Robinhood grows its footprint in Canada by acquiring WonderFi

      HashKey Group Announces Launch of HashKey Global MENA with VASP License in UAE

      Ethereum Breaks Key Resistance In One Massive Move – Higher High Confirms Momentum

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

    Ztoog
    Home » Researchers from Stanford and AWS AI Labs Unveil S4: A Groundbreaking Approach to Pre-Training Vision-Language Models Using Web Screenshots
    AI

    Researchers from Stanford and AWS AI Labs Unveil S4: A Groundbreaking Approach to Pre-Training Vision-Language Models Using Web Screenshots

    Facebook Twitter Pinterest WhatsApp
    Researchers from Stanford and AWS AI Labs Unveil S4: A Groundbreaking Approach to Pre-Training Vision-Language Models Using Web Screenshots
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    In the realm of synthetic intelligence, bridging the hole between imaginative and prescient and language has been a formidable problem. Yet, it harbors immense potential to revolutionize how machines perceive and work together with the world. This article delves into the revolutionary analysis paper that introduces Strongly Supervised pre-training with ScreenPictures (S4), a pioneering technique poised to improve Vision-Language Models (VLMs) by exploiting the huge and complicated knowledge obtainable via internet screenshots. S4 not solely presents a contemporary perspective on pre-training paradigms but additionally considerably boosts mannequin efficiency throughout a spectrum of downstream duties, marking a considerable step ahead within the area.

    Traditionally, foundational fashions for language and imaginative and prescient duties have closely relied on intensive pre-training on giant datasets to obtain generalization. For Vision-Language Models (VLMs), this includes coaching on image-text pairs to be taught representations that may be fine-tuned for particular duties. However, the heterogeneity of imaginative and prescient duties and the shortage of fine-grained, supervised datasets pose limitations. S4 addresses these challenges by leveraging internet screenshots’ wealthy semantic and structural data. This technique makes use of an array of pre-training duties designed to intently mimic downstream purposes, thus offering fashions with a deeper understanding of visible components and their textual descriptions.

    The essence of S4’s strategy lies in its novel pre-training framework that systematically captures and makes use of the varied supervisions embedded inside internet pages. By rendering internet pages into screenshots, the strategy accesses the visible illustration and the textual content material, structure, and hierarchical construction of HTML components. This complete seize of internet knowledge permits the development of ten particular pre-training duties as illustrated in Figure 2, ranging from Optical Character Recognition (OCR) and Image Grounding to subtle Node Relation Prediction and Layout Analysis. Each activity is crafted to reinforce the mannequin’s capacity to discern and interpret the intricate relationships between visible and textual cues, enhancing its efficiency on numerous VLM purposes.

    Empirical outcomes (proven in Table 1) underscore the effectiveness of S4, showcasing exceptional enhancements in mannequin efficiency throughout 9 diverse and standard downstream duties. Notably, the strategy achieved up to 76.1% enchancment in Table Detection and constant good points in Widget Captioning, Screen Summarization, and different duties. This efficiency leap is attributed to the strategy’s strategic exploitation of screenshot knowledge, which enriches the mannequin’s coaching routine with numerous and related visual-textual interactions. Furthermore, the analysis presents an in-depth evaluation of the influence of every pre-training activity, revealing how particular duties contribute to the mannequin’s general prowess in understanding and producing language within the context of visible data.

    In conclusion, S4 heralds a brand new period in vision-language pre-training by methodically harnessing the wealth of visible and textual knowledge obtainable via internet screenshots. Its revolutionary strategy advances the state-of-the-art in VLMs and opens up new avenues for analysis and utility in multimodal AI. By intently aligning pre-training duties with real-world eventualities, S4 ensures that fashions will not be simply skilled however really perceive the nuanced interaction between imaginative and prescient and language, paving the best way for extra clever, versatile, and efficient AI techniques sooner or later.


    Check out the Paper. All credit score for this analysis goes to the researchers of this undertaking. Also, don’t overlook to observe us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you want our work, you’ll love our e-newsletter..

    Don’t Forget to be a part of our 38k+ ML SubReddit

    Want to get in entrance of 1.5 Million AI lovers? Work with us right here


    Vineet Kumar is a consulting intern at MarktechPost. He is presently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He is a Machine Learning fanatic. He is enthusiastic about analysis and the newest developments in Deep Learning, Computer Vision, and associated fields.


    🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    AI

    Study shows vision-language models can’t handle queries with negation words | Ztoog

    AI

    How a new type of AI is helping police skirt facial recognition bans

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    Bitcoin Open Interest Hits Peak Since FTX Crash: What It Means

    In a market that has been comparatively quiet for weeks, Bitcoin (BTC) has all of…

    Mobile

    Xiaomi HyperOS (Android 14) review

    Along with Xiaomi 14 collection reveal, the corporate unexpectedly introduced a very revamped customized Android…

    Science

    Newfound moon around asteroid Dinkinesh is actually two touching rocks

    The asteroid Dinkinesh and its binary contact satellite tv for pcNASA/Goddard/SwRI/Johns Hopkins APL The asteroid…

    Technology

    Tesla Will Open Its Charging Network to G.M.’s Electric Vehicles

    General Motors on Thursday stated it had reached a deal that may permit its electrical…

    Gadgets

    HP Victus 16 Review: A Delight for Gamers and Creators

    HP Victus sequence gaming laptops convey highly effective {hardware} to everybody at a budget-friendly value.…

    Our Picks
    The Future

    General Catalyst eyes VC deal in India push

    The Future

    The (Cleaning) Droid You’re Looking for – Review Geek

    Gadgets

    3 Best Deals From Roborock’s Robot Vacuum Sale

    Categories
    • AI (1,486)
    • Crypto (1,748)
    • Gadgets (1,799)
    • Mobile (1,843)
    • Science (1,858)
    • Technology (1,794)
    • The Future (1,640)
    Most Popular
    Technology

    Memorial Day Sales 2024: Shop the Very Best Deals at Amazon, Walmart, Best Buy and More

    Crypto

    Can Bitcoin Price Climb To $47,000? Here’s What This Crypto Analyst Thinks

    Technology

    The Next Step in Operations – O’Reilly

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.