Close Menu
Ztoog
    What's Hot
    AI

    Why it’s impossible to build an unbiased AI language model

    Mobile

    Google will remove nearly 20 Google Assistant features this month

    Science

    New study: There are lots of icy super-Earths

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

      LiberNovo Omni: The World’s First Dynamic Ergonomic Chair

    • Technology

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

      5 Skills Kids (and Adults) Need in an AI World – O’Reilly

    • Gadgets

      Future-proof your career by mastering AI skills for just $20

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

    • Mobile

      Deals: the Galaxy S25 series comes with a free tablet, Google Pixels heavily discounted

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

    • Science

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

      A trip to the farm where loofahs grow on vines

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

    Ztoog
    Home » Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog
    AI

    Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Kuniaki Saito, Student Researcher, Google Research, Cloud AI Team, and Kihyuk Sohn, Research Scientist, Google Research

    Image retrieval performs an important function in serps. Typically, their customers depend on both image or textual content as a question to retrieve a desired goal image. However, text-based retrieval has its limitations, as describing the goal image precisely utilizing words may be difficult. For occasion, when looking out for a vogue merchandise, customers might want an merchandise whose particular attribute, e.g., the colour of a emblem or the brand itself, is totally different from what they discover in an internet site. Yet looking out for the merchandise in an present search engine isn’t trivial since exactly describing the style merchandise by textual content may be difficult. To deal with this truth, composed image retrieval (CIR) retrieves photos based mostly on a question that mixes each an image and a textual content pattern that gives directions on how to modify the image to match the supposed retrieval goal. Thus, CIR permits exact retrieval of the goal image by combining image and textual content.

    However, CIR strategies require giant quantities of labeled information, i.e., triplets of a 1) question image, 2) description, and three) goal image. Collecting such labeled information is dear, and fashions educated on this information are sometimes tailor-made to a selected use case, limiting their capacity to generalize to totally different datasets.

    To deal with these challenges, in “Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval”, we suggest a job known as zero-shot CIR (ZS-CIR). In ZS-CIR, we purpose to construct a single CIR mannequin that performs a wide range of CIR duties, corresponding to object composition, attribute modifying, or area conversion, with out requiring labeled triplet information. Instead, we suggest to prepare a retrieval mannequin utilizing large-scale image-caption pairs and unlabeled photos, that are significantly simpler to accumulate than supervised CIR datasets at scale. To encourage reproducibility and additional advance this house, we additionally launch the code.

    Description of present composed image retrieval mannequin.
    We prepare a composed image retrieval mannequin utilizing image-caption information solely. Our mannequin retrieves photos aligned with the composition of the question image and textual content.

    Method overview

    We suggest to leverage the language capabilities of the language encoder within the contrastive language-image pre-trained mannequin (CLIP), which excels at producing semantically significant language embeddings for a variety of textual ideas and attributes. To that finish, we use a light-weight mapping sub-module in CLIP that’s designed to map an enter image (e.g., a photograph of a cat) from the image embedding house to a phrase token (e.g., “cat”) within the textual enter house. The entire community is optimized with the vision-language contrastive loss to once more make sure the visible and textual content embedding areas are as shut as attainable given a pair of an image and its textual description. Then, the question image may be handled as if it’s a phrase. This allows the versatile and seamless composition of question image options and textual content descriptions by the language encoder. We name our technique Pic2Word and supply an summary of its coaching course of within the determine under. We need the mapped token s to characterize the enter image within the type of phrase token. Then, we prepare the mapping community to reconstruct the image embedding within the language embedding, p. Specifically, we optimize the contrastive loss proposed in CLIP computed between the visible embedding v and the textual embedding p.

    Training of the mapping community (fM) utilizing unlabeled photos solely. We optimize solely the mapping community with a frozen visible and textual content encoder.

    Given the educated mapping community, we will regard an image as a phrase token and pair it with the textual content description to flexibly compose the joint image-text question as proven within the determine under.

    With the educated mapping community, we regard the image as a phrase token and pair it with the textual content description to flexibly compose the joint image-text question.

    Evaluation

    We conduct a wide range of experiments to consider Pic2Word’s efficiency on a wide range of CIR duties.

    Domain conversion

    We first consider the aptitude of compositionality of the proposed technique on area conversion — given an image and the specified new image area (e.g., sculpture, origami, cartoon, toy), the output of the system ought to be an image with the identical content material however within the new desired image area or model. As illustrated under, we consider the power to compose the class data and area description given as an image and textual content, respectively. We consider the conversion from actual photos to 4 domains utilizing ImageInternet and ImageInternet-R.

    To examine with approaches that don’t require supervised coaching information, we choose three approaches: (i) image solely performs retrieval solely with visible embedding, (ii) textual content solely employs solely textual content embedding, and (iii) image + textual content averages the visible and textual content embedding to compose the question. The comparability with (iii) reveals the significance of composing image and textual content utilizing a language encoder. We additionally examine with Combiner, which trains the CIR mannequin on Fashion-IQ or CIRR.

    We purpose to convert the area of the enter question image into the one described with textual content, e.g., origami.

    As proven in determine under, our proposed method outperforms baselines by a big margin.

    Results (recall@10, i.e., the proportion of related situations within the first 10 photos retrieved.) on composed image retrieval for area conversion.

    Fashion attribute composition

    Next, we consider the composition of vogue attributes, corresponding to the colour of fabric, emblem, and size of sleeve, utilizing the Fashion-IQ dataset. The determine under illustrates the specified output given the question.

    Overview of CIR for vogue attributes.

    In the determine under, we current a comparability with baselines, together with supervised baselines that utilized triplets for coaching the CIR mannequin: (i) CB makes use of the identical structure as our method, (ii) CIRPLANT, ALTEMIS, MAAF use a smaller spine, corresponding to ResNet50. Comparison to these approaches will give us the understanding on how properly our zero-shot method performs on this job.

    Although CB outperforms our method, our technique performs higher than supervised baselines with smaller backbones. This consequence means that by using a strong CLIP mannequin, we will prepare a extremely efficient CIR mannequin with out requiring annotated triplets.

    Results (recall@10, i.e., the proportion of related situations within the first 10 photos retrieved.) on composed image retrieval for Fashion-IQ dataset (greater is healthier). Light blue bars prepare the mannequin utilizing triplets. Note that our method performs on par with these supervised baselines with shallow (smaller) backbones.

    Qualitative outcomes

    We present a number of examples within the determine under. Compared to a baseline technique that doesn’t require supervised coaching information (textual content + image characteristic averaging), our method does a greater job of accurately retrieving the goal image.

    Qualitative outcomes on numerous question photos and textual content description.

    Conclusion and future work

    In this text, we introduce Pic2Word, a way for mapping pictures to words for ZS-CIR. We suggest to convert the image right into a phrase token to obtain a CIR mannequin utilizing solely an image-caption dataset. Through a wide range of experiments, we confirm the effectiveness of the educated mannequin on numerous CIR duties, indicating that coaching on an image-caption dataset can construct a robust CIR mannequin. One potential future analysis path is using caption information to prepare the mapping community, though we use solely image information within the current work.

    Acknowledgements

    This analysis was carried out by Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Also thanks to Zizhao Zhang and Sergey Ioffe for their priceless suggestions.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Science

    Why Isaac Newton’s laws still give physicists a lot to think about

    THERE are two sorts of theoretical physicists: those that use the right equation for calculating…

    Technology

    Vacuum Tubes and Transistors – O’Reilly

    I’ve had a ham radio license for the reason that late Nineteen Sixties and noticed…

    AI

    DeepMind Introduces AlphaDev: A Deep Reinforcement Learning Agent Which Discovers Faster Sorting Algorithms From Scratch

    From Artificial Intelligence and Data Analysis to Cryptography and Optimization, algorithms play an necessary function…

    The Future

    Unveiling My Favorite Websites

    The web is an expansive realm full of a plethora of internet sites catering to…

    Science

    Seafloor Communication Cables, a New Breed of Earthquake-Detectors

    One of probably the most devastating elements of the tsunami that lashed the coasts of…

    Our Picks
    The Future

    Cameo Review 2022: My S.O. Is Impossible to Shop for, But a Cameo Made His Christmas

    Crypto

    Analyst Expects Bitcoin Price Correction To Persist, Targets $57,000 Support

    Technology

    What we know about Morocco’s massive earthquake

    Categories
    • AI (1,493)
    • Crypto (1,753)
    • Gadgets (1,805)
    • Mobile (1,851)
    • Science (1,866)
    • Technology (1,802)
    • The Future (1,648)
    Most Popular
    Gadgets

    Report: Apple changes film strategy, will rarely do wide theatrical releases

    Gadgets

    ASUS ROG Ally review: The best way to game on the go

    Technology

    Challenges for chip startups: TSMC and Nvidia dominate and hold thousands of patents, buying chipmaking gear, and complexity; Nvidia's $300K H100 has 35K parts (June Yoon/Financial Times)

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.