Close Menu
Ztoog
    What's Hot
    Crypto

    Key Level Break Can Shift Price Course

    Science

    A single meteorite smashed into Mars and created 2 billion craters

    The Future

    Has the US finally figured out how to do high-speed rail?

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

      LiberNovo Omni: The World’s First Dynamic Ergonomic Chair

    • Technology

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

      5 Skills Kids (and Adults) Need in an AI World – O’Reilly

    • Gadgets

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

    • Mobile

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

      Forget screens: more details emerge on the mysterious Jony Ive + OpenAI device

    • Science

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

      A trip to the farm where loofahs grow on vines

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

    Ztoog
    Home » Modular visual question answering via code generation – Google Research Blog
    AI

    Modular visual question answering via code generation – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Modular visual question answering via code generation – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Sanjay Subramanian, PhD pupil, UC Berkeley, and Arsha Nagrani, Research Scientist, Google Research, Perception Team

    Visual question answering (VQA) is a machine studying process that requires a mannequin to reply a question about a picture or a set of pictures. Conventional VQA approaches want a considerable amount of labeled coaching knowledge consisting of hundreds of human-annotated question-answer pairs related to pictures. In latest years, advances in large-scale pre-training have led to the event of VQA strategies that carry out nicely with fewer than fifty coaching examples (few-shot) and with none human-annotated VQA coaching knowledge (zero-shot). However, there may be nonetheless a major efficiency hole between these strategies and state-of-the-art absolutely supervised VQA strategies, reminiscent of MaMMUT and VinVL. In specific, few-shot strategies wrestle with spatial reasoning, counting, and multi-hop reasoning. Furthermore, few-shot strategies have typically been restricted to answering questions on single pictures.

    To enhance accuracy on VQA examples that contain advanced reasoning, in “Modular Visual Question Answering via Code Generation,” to look at ACL 2023, we introduce CodeVQA, a framework that solutions visual questions utilizing program synthesis. Specifically, when given a question about a picture or set of pictures, CodeVQA generates a Python program (code) with easy visual capabilities that enable it to course of pictures, and executes this program to find out the reply. We display that within the few-shot setting, CodeVQA outperforms prior work by roughly 3% on the COVR dataset and a couple of% on the GQA dataset.

    CodeVQA

    The CodeVQA method makes use of a code-writing massive language mannequin (LLM), reminiscent of PALM, to generate Python packages (code). We information the LLM to accurately use visual capabilities by crafting a immediate consisting of an outline of those capabilities and fewer than fifteen “in-context” examples of visual questions paired with the related Python code for them. To choose these examples, we compute embeddings for the enter question and of all the questions for which now we have annotated packages (a randomly chosen set of fifty). Then, we choose questions which have the best similarity to the enter and use them as in-context examples. Given the immediate and question that we need to reply, the LLM generates a Python program representing that question.

    We instantiate the CodeVQA framework utilizing three visual capabilities: (1) question, (2) get_pos, and (3) find_matching_image.

    • Query, which solutions a question a few single picture, is carried out utilizing the few-shot Plug-and-Play VQA (PnP-VQA) methodology. PnP-VQA generates captions utilizing BLIP — an image-captioning transformer pre-trained on hundreds of thousands of image-caption pairs — and feeds these right into a LLM that outputs the solutions to the question.
    • Get_pos, which is an object localizer that takes an outline of an object as enter and returns its place within the picture, is carried out utilizing GradCAM. Specifically, the outline and the picture are handed by means of the BLIP joint text-image encoder, which predicts an image-text matching rating. GradCAM takes the gradient of this rating with respect to the picture options to search out the area most related to the textual content.
    • Find_matching_image, which is utilized in multi-image questions to search out the picture that finest matches a given enter phrase, is carried out by utilizing BLIP textual content and picture encoders to compute a textual content embedding for the phrase and a picture embedding for every picture. Then the dot merchandise of the textual content embedding with every picture embedding signify the relevance of every picture to the phrase, and we decide the picture that maximizes this relevance.

    The three capabilities will be carried out utilizing fashions that require little or no annotation (e.g., textual content and image-text pairs collected from the net and a small variety of VQA examples). Furthermore, the CodeVQA framework will be simply generalized past these capabilities to others {that a} consumer may implement (e.g., object detection, picture segmentation, or information base retrieval).

    Illustration of the CodeVQA methodology. First, a big language mannequin generates a Python program (code), which invokes visual capabilities that signify the question. In this instance, a easy VQA methodology (question) is used to reply one a part of the question, and an object localizer (get_pos) is used to search out the positions of the objects talked about. Then this system produces a solution to the unique question by combining the outputs of those capabilities.

    Results

    The CodeVQA framework accurately generates and executes Python packages not just for single-image questions, but additionally for multi-image questions. For instance, if given two pictures, every exhibiting two pandas, a question one may ask is, “Is it true that there are four pandas?” In this case, the LLM converts the counting question concerning the pair of pictures right into a program during which an object rely is obtained for every picture (utilizing the question operate). Then the counts for each pictures are added to compute a complete rely, which is then in comparison with the quantity within the authentic question to yield a sure or no reply.

    We consider CodeVQA on three visual reasoning datasets: GQA (single-image), COVR (multi-image), and NLVR2 (multi-image). For GQA, we offer 12 in-context examples to every methodology, and for COVR and NLVR2, we offer six in-context examples to every methodology. The desk beneath reveals that CodeVQA improves constantly over the baseline few-shot VQA methodology on all three datasets.

    Method       GQA       COVR       NLVR2      
    Few-shot PnP-VQA       46.56       49.06       63.37      
    CodeVQA       49.03       54.11       64.04      

    Results on the GQA, COVR, and NLVR2 datasets, exhibiting that CodeVQA constantly improves over few-shot PnP-VQA. The metric is exact-match accuracy, i.e., the proportion of examples during which the anticipated reply precisely matches the ground-truth reply.

    We discover that in GQA, CodeVQA’s accuracy is roughly 30% increased than the baseline on spatial reasoning questions, 4% increased on “and” questions, and three% increased on “or” questions. The third class contains multi-hop questions reminiscent of “Are there salt shakers or skateboards in the picture?”, for which the generated program is proven beneath.

    img = open_image("Image13.jpg")
    salt_shakers_exist = question(img, "Are there any salt shakers?")
    skateboards_exist = question(img, "Are there any skateboards?")
    if salt_shakers_exist == "sure" or skateboards_exist == "sure":
        reply = "sure"
    else:
        reply = "no"
    

    In COVR, we discover that CodeVQA’s acquire over the baseline is increased when the variety of enter pictures is bigger, as proven within the desk beneath. This pattern signifies that breaking the issue down into single-image questions is useful.

             Number of pictures      
    Method    1    2    3    4    5   
    Few-shot PnP-VQA     91.7    51.5    48.3    47.0    46.9   
    CodeVQA    75.0    53.3    48.7    53.2    53.4   

    Conclusion

    We current CodeVQA, a framework for few-shot visual question answering that depends on code generation to carry out multi-step visual reasoning. Exciting instructions for future work embody increasing the set of modules used and creating the same framework for visual duties past VQA. We word that care must be taken when contemplating whether or not to deploy a system reminiscent of CodeVQA, since vision-language fashions like those utilized in our visual capabilities have been proven to exhibit social biases. At the identical time, in comparison with monolithic fashions, CodeVQA gives extra interpretability (by means of the Python program) and controllability (by modifying the prompts or visual capabilities), that are helpful in manufacturing programs.

    Acknowledgements

    This analysis was a collaboration between UC Berkeley’s Artificial Intelligence Research lab (BAIR) and Google Research, and was carried out by Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, and Dan Klein.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    Ethereum (ETH) Struggles To Break Past $2,600: What’s Driving ETH Down?

    Este artículo también está disponible en español. Ethereum (ETH) has been buying and selling inside…

    Technology

    Rivian lost $1.46 billion in Q2 as it drives towards a VW-linked future

    Rivian’s monetary losses have crept up as it pushed out the final of its first-generation…

    AI

    Meet UniDep: A Tool that Streamlines Python Project Dependency Management by Unifying Conda and Pip Packages in a Single System

    Handling dependencies in Python initiatives can usually develop into daunting, particularly when coping with a…

    Technology

    Hogwarts Legacy is about to break a gaming record last set when we were playing Wii

    For the last 15 years, the best-selling videogame of the yr has both been an…

    AI

    A tiny new open source AI model performs as well as powerful big ones

    Ai2 achieved this by getting human annotators to explain the photographs within the model’s coaching…

    Our Picks
    The Future

    Cristin Milioti Knows Sofia Falcone Is an Iconic Batman Villain in The Penguin

    Mobile

    Google Wallet will now automatically add your tickets and passes from Gmail, more updates

    Crypto

    Historical Playbook Points To $3,800 In Coming Months

    Categories
    • AI (1,493)
    • Crypto (1,753)
    • Gadgets (1,805)
    • Mobile (1,850)
    • Science (1,866)
    • Technology (1,802)
    • The Future (1,648)
    Most Popular
    AI

    Researchers from CMU and Max Planck Institute Unveil WHAM: A Groundbreaking AI Approach for Precise and Efficient 3D Human Motion Estimation from Video

    Science

    Shrimp Vision to Aid Driverless Cars

    Science

    Cicadas pee in jet streams like bigger animals

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.