Close Menu
Ztoog
    What's Hot
    Crypto

    Brine Fi challenges Coinbase, Binance with decentralized exchange as it nets Pantera-led $16.5M round

    Technology

    Samsung’s Budget-Friendly Galaxy S23 FE Is Down to $500 Ahead of S24 Launch

    Technology

    Get to Know the IEEE Board of Directors – June 2024

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

      LiberNovo Omni: The World’s First Dynamic Ergonomic Chair

    • Technology

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

      5 Skills Kids (and Adults) Need in an AI World – O’Reilly

    • Gadgets

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

    • Mobile

      Deals: the Galaxy S25 series comes with a free tablet, Google Pixels heavily discounted

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

    • Science

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

      A trip to the farm where loofahs grow on vines

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

    Ztoog
    Home » Autonomous visual information seeking with large language models – Google Research Blog
    AI

    Autonomous visual information seeking with large language models – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Autonomous visual information seeking with large language models – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Ziniu Hu, Student Researcher, and Alireza Fathi, Research Scientist, Google Research, Perception Team

    There has been nice progress in direction of adapting large language models (LLMs) to accommodate multimodal inputs for duties together with picture captioning, visual query answering (VQA), and open vocabulary recognition. Despite such achievements, present state-of-the-art visual language models (VLMs) carry out inadequately on visual information seeking datasets, akin to Infoseek and OK-VQA, the place exterior data is required to reply the questions.

    Examples of visual information seeking queries the place exterior data is required to reply the query. Images are taken from the OK-VQA dataset.

    In “AVIS: Autonomous Visual Information Seeking with Large Language Models”, we introduce a novel methodology that achieves state-of-the-art outcomes on visual information seeking duties. Our methodology integrates LLMs with three forms of instruments: (i) pc imaginative and prescient instruments for extracting visual information from photos, (ii) an online search software for retrieving open world data and details, and (iii) a picture search software to glean related information from metadata related with visually related photos. AVIS employs an LLM-powered planner to decide on instruments and queries at every step. It additionally makes use of an LLM-powered reasoner to research software outputs and extract key information. A working reminiscence part retains information all through the method.

    An instance of AVIS’s generated workflow for answering a difficult visual information seeking query. The enter picture is taken from the Infoseek dataset.

    Comparison to earlier work

    Recent research (e.g., Chameleon, ViperGPT and MM-ReAct) explored including instruments to LLMs for multimodal inputs. These methods comply with a two-stage course of: planning (breaking down questions into structured packages or directions) and execution (utilizing instruments to collect information). Despite success in fundamental duties, this method usually falters in complicated real-world eventualities.

    There has additionally been a surge of curiosity in making use of LLMs as autonomous brokers (e.g., WebGPT and ReAct). These brokers work together with their surroundings, adapt primarily based on real-time suggestions, and obtain objectives. However, these strategies don’t prohibit the instruments that may be invoked at every stage, resulting in an immense search house. Consequently, even essentially the most superior LLMs as we speak can fall into infinite loops or propagate errors. AVIS tackles this by way of guided LLM use, influenced by human choices from a person research.

    Informing LLM choice making with a person research

    Many of the visual questions in datasets akin to Infoseek and OK-VQA pose a problem even for people, usually requiring the help of numerous instruments and APIs. An instance query from the OK-VQA dataset is proven under. We performed a person research to know human decision-making when utilizing exterior instruments.

    We performed a person research to know human decision-making when utilizing exterior instruments. Image is taken from the OK-VQA dataset.

    The customers had been geared up with an an identical set of instruments as our methodology, together with PALI, PaLM, and internet search. They acquired enter photos, questions, detected object crops, and buttons linked to picture search outcomes. These buttons provided various information concerning the detected object crops, akin to data graph entities, related picture captions, associated product titles, and an identical picture captions.

    We document person actions and outputs and use it as a information for our system in two key methods. First, we assemble a transition graph (proven under) by analyzing the sequence of choices made by customers. This graph defines distinct states and restricts the out there set of actions at every state. For instance, initially state, the system can take solely certainly one of these three actions: PALI caption, PALI VQA, or object detection. Second, we use the examples of human decision-making to information our planner and reasoner with related contextual situations to reinforce the efficiency and effectiveness of our system.

    AVIS transition graph.

    General framework

    Our method employs a dynamic decision-making technique designed to reply to visual information-seeking queries. Our system has three major elements. First, we’ve a planner to find out the next motion, together with the suitable API name and the question it must course of. Second, we’ve a working reminiscence that retains information concerning the outcomes obtained from API executions. Last, we’ve a reasoner, whose position is to course of the outputs from the API calls. It determines whether or not the obtained information is ample to supply the ultimate response, or if further knowledge retrieval is required.

    The planner undertakes a sequence of steps every time a call is required relating to which software to make use of and what question to ship to it. Based on the current state, the planner supplies a variety of potential subsequent actions. The potential motion house could also be so large that it makes the search house intractable. To tackle this problem, the planner refers back to the transition graph to remove irrelevant actions. The planner additionally excludes the actions which have already been taken earlier than and are saved within the working reminiscence.

    Next, the planner collects a set of related in-context examples which can be assembled from the selections beforehand made by people throughout the person research. With these examples and the working reminiscence that holds knowledge collected from previous software interactions, the planner formulates a immediate. The immediate is then despatched to the LLM, which returns a structured reply, figuring out the subsequent software to be activated and the question to be dispatched to it. This design permits the planner to be invoked a number of instances all through the method, thereby facilitating dynamic decision-making that steadily results in answering the enter question.

    We make use of a reasoner to research the output of the software execution, extract the helpful information and determine into which class the software output falls: informative, uninformative, or last reply. Our methodology makes use of the LLM with applicable prompting and in-context examples to carry out the reasoning. If the reasoner concludes that it’s prepared to supply a solution, it should output the ultimate response, thus concluding the duty. If it determines that the software output is uninformative, it should revert again to the planner to pick out one other motion primarily based on the present state. If it finds the software output to be helpful, it should modify the state and switch management again to the planner to make a brand new choice on the new state.

    AVIS employs a dynamic decision-making technique to reply to visual information-seeking queries.

    Results

    We consider AVIS on Infoseek and OK-VQA datasets. As proven under, even sturdy visual-language models, akin to OFA and PaLI, fail to yield excessive accuracy when fine-tuned on Infoseek. Our method (AVIS), with out fine-tuning, achieves 50.7% accuracy on the unseen entity cut up of this dataset.

    AVIS visual query answering outcomes on Infoseek dataset. AVIS achieves larger accuracy compared to earlier baselines primarily based on PaLI, PaLM and OFA.

    Our outcomes on the OK-VQA dataset are proven under. AVIS with few-shot in-context examples achieves an accuracy of 60.2%, larger than a lot of the earlier works. AVIS achieves decrease however comparable accuracy compared to the PALI mannequin fine-tuned on OK-VQA. This distinction, in comparison with Infoseek the place AVIS outperforms fine-tuned PALI, is because of the truth that most question-answer examples in OK-VQA depend on frequent sense data somewhat than on fine-grained data. Therefore, PaLI is ready to encode such generic data within the mannequin parameters and doesn’t require exterior data.

    Visual query answering outcomes on A-OKVQA. AVIS achieves larger accuracy compared to earlier works that use few-shot or zero-shot studying, together with Flamingo, PaLI and ViperGPT. AVIS additionally achieves larger accuracy than a lot of the earlier works which can be fine-tuned on OK-VQA dataset, together with REVEAL, ReVIVE, KAT and KRISP, and achieves outcomes which can be near the fine-tuned PaLI mannequin.

    Conclusion

    We current a novel method that equips LLMs with the flexibility to make use of quite a lot of instruments for answering knowledge-intensive visual questions. Our methodology, anchored in human decision-making knowledge collected from a person research, employs a structured framework that makes use of an LLM-powered planner to dynamically determine on software choice and question formation. An LLM-powered reasoner is tasked with processing and extracting key information from the output of the chosen software. Our methodology iteratively employs the planner and reasoner to leverage totally different instruments till all vital information required to reply the visual query is amassed.

    Acknowledgements

    This analysis was performed by Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A. Ross, Cordelia Schmid and Alireza Fathi.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Gadgets

    Hands-on with Cherry MX2A switches: A lot less wobble, a little more confusion

    Enlarge / Cherry’s new MX2A mechanical switches (from left to proper): Red, Blue, Brown, Black,…

    Science

    Orionids: How to see the Halley’s comet meteor shower in October and November

    The Orionids, seen over Daqing in Heilongjiang province, China, on 22 October 2020Sipa US /…

    Technology

    South Korea's trade ministry says the electronics sector drew $3B in foreign direct investment in 2023; the country is building a chip cluster south of Seoul (Sam Kim/Bloomberg)

    Sam Kim / Bloomberg: South Korea’s trade ministry says the electronics sector drew $3B in…

    Crypto

    Bitcoin Spot ETF Poised To Lure In Fresh Institutional Investors

    President of the Chicago Board Options Exchange (CBOE) John Palmer has revealed his optimism on the…

    Technology

    Shoddy self-published guidebooks, which appear to be compiled with the help of generative AI and promoted via deceptive reviews, are proliferating on Amazon (New York Times)

    New York Times: Shoddy self-published guidebooks, which appear to be compiled with the help of…

    Our Picks
    Science

    High blood pressure treatments could save millions, WHO says

    AI

    This AI Paper Proposes Blending-NeRF that Consists of Pretrained NeRF and Editable NeRF for Text-Driven Localized 3D Object Editing

    Crypto

    Bitcoin Transaction Volumes Hit New Highs Amid Signs Of A Supply Shock

    Categories
    • AI (1,493)
    • Crypto (1,753)
    • Gadgets (1,805)
    • Mobile (1,851)
    • Science (1,866)
    • Technology (1,802)
    • The Future (1,648)
    Most Popular
    Gadgets

    Best Sonos Setup (2023): Which Speakers and Soundbars Should You Buy?

    Gadgets

    Rivian R2, R3, R3X Electric SUVs: Price, Specs, Release Date

    Science

    Illegal Trawlers Are No Match for Undersea Sculptures

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.