Close Menu
Ztoog
    What's Hot
    Gadgets

    The best electric garage heaters of 2023

    Technology

    Apple abandons its car: Here are other projects the company has killed

    Technology

    The UK CMA says Meta has offered to limit use of other businesses' ad data for Facebook Marketplace to address the regulator's competition concerns (Reuters)

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      OPPO launches A5 Pro 5G: Premium features at a budget price

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

    • Technology

      What It Is and Why It Matters—Part 1 – O’Reilly

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Motorola’s Moto Watch needs to start living up to the brand name

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

    • Science

      Nothing is stronger than quantum connections – and now we know why

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

    • AI

      Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

    • Crypto

      Ethereum Breaks Key Resistance In One Massive Move – Higher High Confirms Momentum

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

    Ztoog
    Home » Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog
    AI

    Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Kuniaki Saito, Student Researcher, Google Research, Cloud AI Team, and Kihyuk Sohn, Research Scientist, Google Research

    Image retrieval performs an important function in serps. Typically, their customers depend on both image or textual content as a question to retrieve a desired goal image. However, text-based retrieval has its limitations, as describing the goal image precisely utilizing words may be difficult. For occasion, when looking out for a vogue merchandise, customers might want an merchandise whose particular attribute, e.g., the colour of a emblem or the brand itself, is totally different from what they discover in an internet site. Yet looking out for the merchandise in an present search engine isn’t trivial since exactly describing the style merchandise by textual content may be difficult. To deal with this truth, composed image retrieval (CIR) retrieves photos based mostly on a question that mixes each an image and a textual content pattern that gives directions on how to modify the image to match the supposed retrieval goal. Thus, CIR permits exact retrieval of the goal image by combining image and textual content.

    However, CIR strategies require giant quantities of labeled information, i.e., triplets of a 1) question image, 2) description, and three) goal image. Collecting such labeled information is dear, and fashions educated on this information are sometimes tailor-made to a selected use case, limiting their capacity to generalize to totally different datasets.

    To deal with these challenges, in “Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval”, we suggest a job known as zero-shot CIR (ZS-CIR). In ZS-CIR, we purpose to construct a single CIR mannequin that performs a wide range of CIR duties, corresponding to object composition, attribute modifying, or area conversion, with out requiring labeled triplet information. Instead, we suggest to prepare a retrieval mannequin utilizing large-scale image-caption pairs and unlabeled photos, that are significantly simpler to accumulate than supervised CIR datasets at scale. To encourage reproducibility and additional advance this house, we additionally launch the code.

    Description of present composed image retrieval mannequin.
    We prepare a composed image retrieval mannequin utilizing image-caption information solely. Our mannequin retrieves photos aligned with the composition of the question image and textual content.

    Method overview

    We suggest to leverage the language capabilities of the language encoder within the contrastive language-image pre-trained mannequin (CLIP), which excels at producing semantically significant language embeddings for a variety of textual ideas and attributes. To that finish, we use a light-weight mapping sub-module in CLIP that’s designed to map an enter image (e.g., a photograph of a cat) from the image embedding house to a phrase token (e.g., “cat”) within the textual enter house. The entire community is optimized with the vision-language contrastive loss to once more make sure the visible and textual content embedding areas are as shut as attainable given a pair of an image and its textual description. Then, the question image may be handled as if it’s a phrase. This allows the versatile and seamless composition of question image options and textual content descriptions by the language encoder. We name our technique Pic2Word and supply an summary of its coaching course of within the determine under. We need the mapped token s to characterize the enter image within the type of phrase token. Then, we prepare the mapping community to reconstruct the image embedding within the language embedding, p. Specifically, we optimize the contrastive loss proposed in CLIP computed between the visible embedding v and the textual embedding p.

    Training of the mapping community (fM) utilizing unlabeled photos solely. We optimize solely the mapping community with a frozen visible and textual content encoder.

    Given the educated mapping community, we will regard an image as a phrase token and pair it with the textual content description to flexibly compose the joint image-text question as proven within the determine under.

    With the educated mapping community, we regard the image as a phrase token and pair it with the textual content description to flexibly compose the joint image-text question.

    Evaluation

    We conduct a wide range of experiments to consider Pic2Word’s efficiency on a wide range of CIR duties.

    Domain conversion

    We first consider the aptitude of compositionality of the proposed technique on area conversion — given an image and the specified new image area (e.g., sculpture, origami, cartoon, toy), the output of the system ought to be an image with the identical content material however within the new desired image area or model. As illustrated under, we consider the power to compose the class data and area description given as an image and textual content, respectively. We consider the conversion from actual photos to 4 domains utilizing ImageInternet and ImageInternet-R.

    To examine with approaches that don’t require supervised coaching information, we choose three approaches: (i) image solely performs retrieval solely with visible embedding, (ii) textual content solely employs solely textual content embedding, and (iii) image + textual content averages the visible and textual content embedding to compose the question. The comparability with (iii) reveals the significance of composing image and textual content utilizing a language encoder. We additionally examine with Combiner, which trains the CIR mannequin on Fashion-IQ or CIRR.

    We purpose to convert the area of the enter question image into the one described with textual content, e.g., origami.

    As proven in determine under, our proposed method outperforms baselines by a big margin.

    Results (recall@10, i.e., the proportion of related situations within the first 10 photos retrieved.) on composed image retrieval for area conversion.

    Fashion attribute composition

    Next, we consider the composition of vogue attributes, corresponding to the colour of fabric, emblem, and size of sleeve, utilizing the Fashion-IQ dataset. The determine under illustrates the specified output given the question.

    Overview of CIR for vogue attributes.

    In the determine under, we current a comparability with baselines, together with supervised baselines that utilized triplets for coaching the CIR mannequin: (i) CB makes use of the identical structure as our method, (ii) CIRPLANT, ALTEMIS, MAAF use a smaller spine, corresponding to ResNet50. Comparison to these approaches will give us the understanding on how properly our zero-shot method performs on this job.

    Although CB outperforms our method, our technique performs higher than supervised baselines with smaller backbones. This consequence means that by using a strong CLIP mannequin, we will prepare a extremely efficient CIR mannequin with out requiring annotated triplets.

    Results (recall@10, i.e., the proportion of related situations within the first 10 photos retrieved.) on composed image retrieval for Fashion-IQ dataset (greater is healthier). Light blue bars prepare the mannequin utilizing triplets. Note that our method performs on par with these supervised baselines with shallow (smaller) backbones.

    Qualitative outcomes

    We present a number of examples within the determine under. Compared to a baseline technique that doesn’t require supervised coaching information (textual content + image characteristic averaging), our method does a greater job of accurately retrieving the goal image.

    Qualitative outcomes on numerous question photos and textual content description.

    Conclusion and future work

    In this text, we introduce Pic2Word, a way for mapping pictures to words for ZS-CIR. We suggest to convert the image right into a phrase token to obtain a CIR mannequin utilizing solely an image-caption dataset. Through a wide range of experiments, we confirm the effectiveness of the educated mannequin on numerous CIR duties, indicating that coaching on an image-caption dataset can construct a robust CIR mannequin. One potential future analysis path is using caption information to prepare the mapping community, though we use solely image information within the current work.

    Acknowledgements

    This analysis was carried out by Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Also thanks to Zizhao Zhang and Sergey Ioffe for their priceless suggestions.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    The Future

    Xiaomi’s latest foldable is durable enough to survive half a million folds

    Xiaomi has introduced a new foldable for Chinese customers to think about shopping for —…

    Science

    Cocktail science: Mix these drinks and learn about fluid dynamics

    Delicious drinks might be the right miniature laboratory for demonstrating the bizarre physics of fluids.…

    Gadgets

    Philips Hue Secure Wired Camera Review: Works if You Love Hue

    With a longtime repute for high quality good lighting, albeit at a hefty value, Philips…

    Gadgets

    NTT Sonority’s Nwm MBH001 Open-ear Headphones With Personalized Sound Zone Technology, Coming Later This Year

    At Upgrade 2024, NTT Research demonstrated a variety of its R&D initiatives and upcoming merchandise.…

    The Future

    Samsung is kicking off its One UI 6 beta program

    Samsung is launching its beta program for One UI 6, its model of Android that’s…

    Our Picks
    Mobile

    Samsung Galaxy S24 pre-orders are now live in the UK, Galaxy A25 tags along

    Gadgets

    Mercedes-Benz eActros 600: Electric Truck Shows Great Autonomy In Warm And Cold Weather

    Crypto

    Mixin hacked for $200M, Worldcoin eyes greater expansion and Haun Ventures’ execs talk crypto regulation

    Categories
    • AI (1,483)
    • Crypto (1,745)
    • Gadgets (1,796)
    • Mobile (1,840)
    • Science (1,854)
    • Technology (1,790)
    • The Future (1,636)
    Most Popular
    Mobile

    YouTube introduces new design to make watching on TV more engaging

    Gadgets

    Power several devices with $38 off a two-pack of 6-in-1 charging cables

    Science

    States Are Lining Up to Outlaw Lab-Grown Meat

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.