Close Menu
Ztoog
    What's Hot
    Technology

    What We Learned from a Year of Building with LLMs (Part I) – O’Reilly

    Mobile

    WhatsApp will get reverse image search

    Mobile

    This is the Infinix GT 10 Pro, a gaming smartphone landing in early August

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

      Bitcoin Trades Below ETF Cost-Basis As MVRV Signals Mounting Pressure

    Ztoog
    Home » Modular visual question answering via code generation – Google Research Blog
    AI

    Modular visual question answering via code generation – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Modular visual question answering via code generation – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Sanjay Subramanian, PhD scholar, UC Berkeley, and Arsha Nagrani, Research Scientist, Google Research, Perception Team

    Visual question answering (VQA) is a machine studying activity that requires a mannequin to reply a question about a picture or a set of pictures. Conventional VQA approaches want a considerable amount of labeled coaching knowledge consisting of 1000’s of human-annotated question-answer pairs related to pictures. In latest years, advances in large-scale pre-training have led to the event of VQA strategies that carry out nicely with fewer than fifty coaching examples (few-shot) and with none human-annotated VQA coaching knowledge (zero-shot). However, there’s nonetheless a major efficiency hole between these strategies and state-of-the-art totally supervised VQA strategies, comparable to MaMMUT and VinVL. In explicit, few-shot strategies battle with spatial reasoning, counting, and multi-hop reasoning. Furthermore, few-shot strategies have usually been restricted to answering questions on single pictures.

    To enhance accuracy on VQA examples that contain advanced reasoning, in “Modular Visual Question Answering via Code Generation,” to seem at ACL 2023, we introduce CodeVQA, a framework that solutions visual questions utilizing program synthesis. Specifically, when given a question about a picture or set of pictures, CodeVQA generates a Python program (code) with easy visual features that enable it to course of pictures, and executes this program to find out the reply. We show that within the few-shot setting, CodeVQA outperforms prior work by roughly 3% on the COVR dataset and a pair of% on the GQA dataset.

    CodeVQA

    The CodeVQA method makes use of a code-writing giant language mannequin (LLM), comparable to PALM, to generate Python applications (code). We information the LLM to appropriately use visual features by crafting a immediate consisting of an outline of those features and fewer than fifteen “in-context” examples of visual questions paired with the related Python code for them. To choose these examples, we compute embeddings for the enter question and of all the questions for which we have now annotated applications (a randomly chosen set of fifty). Then, we choose questions which have the very best similarity to the enter and use them as in-context examples. Given the immediate and question that we wish to reply, the LLM generates a Python program representing that question.

    We instantiate the CodeVQA framework utilizing three visual features: (1) question, (2) get_pos, and (3) find_matching_image.

    • Query, which solutions a question a few single picture, is carried out utilizing the few-shot Plug-and-Play VQA (PnP-VQA) technique. PnP-VQA generates captions utilizing BLIP — an image-captioning transformer pre-trained on thousands and thousands of image-caption pairs — and feeds these right into a LLM that outputs the solutions to the question.
    • Get_pos, which is an object localizer that takes an outline of an object as enter and returns its place within the picture, is carried out utilizing GradCAM. Specifically, the outline and the picture are handed via the BLIP joint text-image encoder, which predicts an image-text matching rating. GradCAM takes the gradient of this rating with respect to the picture options to seek out the area most related to the textual content.
    • Find_matching_image, which is utilized in multi-image questions to seek out the picture that finest matches a given enter phrase, is carried out by utilizing BLIP textual content and picture encoders to compute a textual content embedding for the phrase and a picture embedding for every picture. Then the dot merchandise of the textual content embedding with every picture embedding characterize the relevance of every picture to the phrase, and we choose the picture that maximizes this relevance.

    The three features might be carried out utilizing fashions that require little or no annotation (e.g., textual content and image-text pairs collected from the net and a small variety of VQA examples). Furthermore, the CodeVQA framework might be simply generalized past these features to others {that a} consumer may implement (e.g., object detection, picture segmentation, or data base retrieval).

    Illustration of the CodeVQA technique. First, a big language mannequin generates a Python program (code), which invokes visual features that characterize the question. In this instance, a easy VQA technique (question) is used to reply one a part of the question, and an object localizer (get_pos) is used to seek out the positions of the objects talked about. Then this system produces a solution to the unique question by combining the outputs of those features.

    Results

    The CodeVQA framework appropriately generates and executes Python applications not just for single-image questions, but in addition for multi-image questions. For instance, if given two pictures, every displaying two pandas, a question one may ask is, “Is it true that there are four pandas?” In this case, the LLM converts the counting question concerning the pair of pictures right into a program through which an object rely is obtained for every picture (utilizing the question operate). Then the counts for each pictures are added to compute a complete rely, which is then in comparison with the quantity within the unique question to yield a sure or no reply.

    We consider CodeVQA on three visual reasoning datasets: GQA (single-image), COVR (multi-image), and NLVR2 (multi-image). For GQA, we offer 12 in-context examples to every technique, and for COVR and NLVR2, we offer six in-context examples to every technique. The desk beneath reveals that CodeVQA improves persistently over the baseline few-shot VQA technique on all three datasets.

    Method       GQA       COVR       NLVR2      
    Few-shot PnP-VQA       46.56       49.06       63.37      
    CodeVQA       49.03       54.11       64.04      

    Results on the GQA, COVR, and NLVR2 datasets, displaying that CodeVQA persistently improves over few-shot PnP-VQA. The metric is exact-match accuracy, i.e., the share of examples through which the anticipated reply precisely matches the ground-truth reply.

    We discover that in GQA, CodeVQA’s accuracy is roughly 30% increased than the baseline on spatial reasoning questions, 4% increased on “and” questions, and three% increased on “or” questions. The third class contains multi-hop questions comparable to “Are there salt shakers or skateboards in the picture?”, for which the generated program is proven beneath.

    img = open_image("Image13.jpg")
    salt_shakers_exist = question(img, "Are there any salt shakers?")
    skateboards_exist = question(img, "Are there any skateboards?")
    if salt_shakers_exist == "sure" or skateboards_exist == "sure":
        reply = "sure"
    else:
        reply = "no"
    

    In COVR, we discover that CodeVQA’s acquire over the baseline is increased when the variety of enter pictures is bigger, as proven within the desk beneath. This pattern signifies that breaking the issue down into single-image questions is useful.

             Number of pictures      
    Method    1    2    3    4    5   
    Few-shot PnP-VQA     91.7    51.5    48.3    47.0    46.9   
    CodeVQA    75.0    53.3    48.7    53.2    53.4   

    Conclusion

    We current CodeVQA, a framework for few-shot visual question answering that depends on code generation to carry out multi-step visual reasoning. Exciting instructions for future work embrace increasing the set of modules used and creating an analogous framework for visual duties past VQA. We observe that care ought to be taken when contemplating whether or not to deploy a system comparable to CodeVQA, since vision-language fashions like those utilized in our visual features have been proven to exhibit social biases. At the identical time, in comparison with monolithic fashions, CodeVQA provides further interpretability (via the Python program) and controllability (by modifying the prompts or visual features), that are helpful in manufacturing methods.

    Acknowledgements

    This analysis was a collaboration between UC Berkeley’s Artificial Intelligence Research lab (BAIR) and Google Research, and was carried out by Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, and Dan Klein.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Gadgets

    This top-rated 13-in-1 docking station is now an additional 20% off for a limited time

    We could earn income from the merchandise out there on this web page and take…

    Mobile

    Here are the new features and bug fixes for the Nothing OS 2.0.3 update

    Nothing has launched Nothing OS 2.0.3 for the Nothing Phone (2) and the update consists…

    Gadgets

    Hong Kong Tests Ground-Level Red Lights To Hold Back Phone-Distracted Walking

    Hong Kong has launched a pilot challenge aimed toward bettering pedestrian security by addressing the…

    Science

    If You Didn’t Care About Antarctica’s Icy Belly, You Will Now

    “This is a groundbreaking study using state-of-the-art underwater technology to explore critical regions of Antarctica…

    Mobile

    The Apple Vision Pro is a rare opportunity for Google, Samsung, and the Quest 3

    Is the Apple Vision Pro value $3,500? That’s one thing the tech world will hold…

    Our Picks
    Crypto

    Reality Check: MATIC Investors Count Losses

    Crypto

    Ethereum Open Interest Hits Record High Of $17 Billion — Bearish Or Bullish For ETH Price?

    Gadgets

    Due To Volcanic Activity, Dinosaurs Were Doomed To Extinction, Even Without A Meteorite

    Categories
    • AI (1,560)
    • Crypto (1,826)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    Crypto

    Analyst Thinks Dream Milestone Could Be Hit In Coming Weeks

    Mobile

    The premium Motorola Razr 40 Ultra’s U.S. name is leaked

    Technology

    The sleeper hits of Summer Game Fest 2023

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.