Close Menu
Ztoog
    What's Hot
    Crypto

    These 5 Crypto Analysts Signal Potential For Record-Shattering Bull Market In Early 2024

    AI

    Perplexity Unveils Two New Online LLM Models: ‘pplx-7b-online’ and ‘pplx-70b-online’

    Gadgets

    Get 6-8 Microsoft apps on your PC or Mac forever with this $69.99 lifetime license

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      What is Project Management? 5 Best Tools that You Can Try

      Operational excellence strategy and continuous improvement

      Hannah Fry: AI isn’t as powerful as we think

      FanDuel goes all in on responsible gaming push with new Play with a Plan campaign

      Gettyimages.com Is the Best Website on the Internet Right Now

    • Technology

      Iran war: How could it end?

      Democratic senators question CFTC staffing cuts in Chicago enforcement office

      Google’s Cloud AI lead on the three frontiers of model capability

      AMD agrees to backstop a $300M loan from Goldman Sachs for Crusoe to buy AMD AI chips, the first known case of AMD chips used as debt collateral (The Information)

      Productivity apps failed me when I needed them most

    • Gadgets

      macOS Tahoe 26.3.1 update will “upgrade” your M5’s CPU to new “super” cores

      Lenovo Shows Off a ThinkBook Modular AI PC Concept With Swappable Ports and Detachable Displays at MWC 2026

      POCO M8 Review: The Ultimate Budget Smartphone With Some Cons

      The Mission: Impossible of SSDs has arrived with a fingerprint lock

      6 Best Phones With Headphone Jacks (2026), Tested and Reviewed

    • Mobile

      Android’s March update is all about finding people, apps, and your missing bags

      Watch Xiaomi’s global launch event live here

      Our poll shows what buyers actually care about in new smartphones (Hint: it’s not AI)

      Is Strava down for you? You’re not alone

      The Motorola Razr FIFA World Cup 2026 Edition was literally just unveiled, and Verizon is already giving them away

    • Science

      Big Tech Signs White House Data Center Pledge With Good Optics and Little Substance

      Inside the best dark matter detector ever built

      NASA’s Artemis moon exploration programme is getting a major makeover

      Scientists crack the case of “screeching” Scotch tape

      Blue-faced, puffy-lipped monkey scores a rare conservation win

    • AI

      Online harassment is entering its AI era

      Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

      New method could increase LLM training efficiency | Ztoog

      The human work behind humanoid robots is being hidden

      NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    • Crypto

      SEC Vs. Justin Sun Case Ends In $10M Settlement

      Google paid startup Form Energy $1B for its massive 100-hour battery

      Ethereum Breakout Alert: Corrective Channel Flip Sparks Impulsive Wave

      Show Your ID Or No Deal

      Jane Street sued for alleged front-running trades that accelerated Terraform Labs meltdown

    Ztoog
    Home » Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog
    AI

    Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Mapping pictures to words for zero-shot composed image retrieval – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Kuniaki Saito, Student Researcher, Google Research, Cloud AI Team, and Kihyuk Sohn, Research Scientist, Google Research

    Image retrieval performs an important function in serps. Typically, their customers depend on both image or textual content as a question to retrieve a desired goal image. However, text-based retrieval has its limitations, as describing the goal image precisely utilizing words may be difficult. For occasion, when looking out for a vogue merchandise, customers might want an merchandise whose particular attribute, e.g., the colour of a emblem or the brand itself, is totally different from what they discover in an internet site. Yet looking out for the merchandise in an present search engine isn’t trivial since exactly describing the style merchandise by textual content may be difficult. To deal with this truth, composed image retrieval (CIR) retrieves photos based mostly on a question that mixes each an image and a textual content pattern that gives directions on how to modify the image to match the supposed retrieval goal. Thus, CIR permits exact retrieval of the goal image by combining image and textual content.

    However, CIR strategies require giant quantities of labeled information, i.e., triplets of a 1) question image, 2) description, and three) goal image. Collecting such labeled information is dear, and fashions educated on this information are sometimes tailor-made to a selected use case, limiting their capacity to generalize to totally different datasets.

    To deal with these challenges, in “Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval”, we suggest a job known as zero-shot CIR (ZS-CIR). In ZS-CIR, we purpose to construct a single CIR mannequin that performs a wide range of CIR duties, corresponding to object composition, attribute modifying, or area conversion, with out requiring labeled triplet information. Instead, we suggest to prepare a retrieval mannequin utilizing large-scale image-caption pairs and unlabeled photos, that are significantly simpler to accumulate than supervised CIR datasets at scale. To encourage reproducibility and additional advance this house, we additionally launch the code.

    Description of present composed image retrieval mannequin.
    We prepare a composed image retrieval mannequin utilizing image-caption information solely. Our mannequin retrieves photos aligned with the composition of the question image and textual content.

    Method overview

    We suggest to leverage the language capabilities of the language encoder within the contrastive language-image pre-trained mannequin (CLIP), which excels at producing semantically significant language embeddings for a variety of textual ideas and attributes. To that finish, we use a light-weight mapping sub-module in CLIP that’s designed to map an enter image (e.g., a photograph of a cat) from the image embedding house to a phrase token (e.g., “cat”) within the textual enter house. The entire community is optimized with the vision-language contrastive loss to once more make sure the visible and textual content embedding areas are as shut as attainable given a pair of an image and its textual description. Then, the question image may be handled as if it’s a phrase. This allows the versatile and seamless composition of question image options and textual content descriptions by the language encoder. We name our technique Pic2Word and supply an summary of its coaching course of within the determine under. We need the mapped token s to characterize the enter image within the type of phrase token. Then, we prepare the mapping community to reconstruct the image embedding within the language embedding, p. Specifically, we optimize the contrastive loss proposed in CLIP computed between the visible embedding v and the textual embedding p.

    Training of the mapping community (fM) utilizing unlabeled photos solely. We optimize solely the mapping community with a frozen visible and textual content encoder.

    Given the educated mapping community, we will regard an image as a phrase token and pair it with the textual content description to flexibly compose the joint image-text question as proven within the determine under.

    With the educated mapping community, we regard the image as a phrase token and pair it with the textual content description to flexibly compose the joint image-text question.

    Evaluation

    We conduct a wide range of experiments to consider Pic2Word’s efficiency on a wide range of CIR duties.

    Domain conversion

    We first consider the aptitude of compositionality of the proposed technique on area conversion — given an image and the specified new image area (e.g., sculpture, origami, cartoon, toy), the output of the system ought to be an image with the identical content material however within the new desired image area or model. As illustrated under, we consider the power to compose the class data and area description given as an image and textual content, respectively. We consider the conversion from actual photos to 4 domains utilizing ImageInternet and ImageInternet-R.

    To examine with approaches that don’t require supervised coaching information, we choose three approaches: (i) image solely performs retrieval solely with visible embedding, (ii) textual content solely employs solely textual content embedding, and (iii) image + textual content averages the visible and textual content embedding to compose the question. The comparability with (iii) reveals the significance of composing image and textual content utilizing a language encoder. We additionally examine with Combiner, which trains the CIR mannequin on Fashion-IQ or CIRR.

    We purpose to convert the area of the enter question image into the one described with textual content, e.g., origami.

    As proven in determine under, our proposed method outperforms baselines by a big margin.

    Results (recall@10, i.e., the proportion of related situations within the first 10 photos retrieved.) on composed image retrieval for area conversion.

    Fashion attribute composition

    Next, we consider the composition of vogue attributes, corresponding to the colour of fabric, emblem, and size of sleeve, utilizing the Fashion-IQ dataset. The determine under illustrates the specified output given the question.

    Overview of CIR for vogue attributes.

    In the determine under, we current a comparability with baselines, together with supervised baselines that utilized triplets for coaching the CIR mannequin: (i) CB makes use of the identical structure as our method, (ii) CIRPLANT, ALTEMIS, MAAF use a smaller spine, corresponding to ResNet50. Comparison to these approaches will give us the understanding on how properly our zero-shot method performs on this job.

    Although CB outperforms our method, our technique performs higher than supervised baselines with smaller backbones. This consequence means that by using a strong CLIP mannequin, we will prepare a extremely efficient CIR mannequin with out requiring annotated triplets.

    Results (recall@10, i.e., the proportion of related situations within the first 10 photos retrieved.) on composed image retrieval for Fashion-IQ dataset (greater is healthier). Light blue bars prepare the mannequin utilizing triplets. Note that our method performs on par with these supervised baselines with shallow (smaller) backbones.

    Qualitative outcomes

    We present a number of examples within the determine under. Compared to a baseline technique that doesn’t require supervised coaching information (textual content + image characteristic averaging), our method does a greater job of accurately retrieving the goal image.

    Qualitative outcomes on numerous question photos and textual content description.

    Conclusion and future work

    In this text, we introduce Pic2Word, a way for mapping pictures to words for ZS-CIR. We suggest to convert the image right into a phrase token to obtain a CIR mannequin utilizing solely an image-caption dataset. Through a wide range of experiments, we confirm the effectiveness of the educated mannequin on numerous CIR duties, indicating that coaching on an image-caption dataset can construct a robust CIR mannequin. One potential future analysis path is using caption information to prepare the mapping community, though we use solely image information within the current work.

    Acknowledgements

    This analysis was carried out by Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Also thanks to Zizhao Zhang and Sergey Ioffe for their priceless suggestions.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Online harassment is entering its AI era

    AI

    Meet NullClaw: The 678 KB Zig AI Agent Framework Running on 1 MB RAM and Booting in Two Milliseconds

    AI

    New method could increase LLM training efficiency | Ztoog

    AI

    The human work behind humanoid robots is being hidden

    AI

    NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

    AI

    Personalization features can make LLMs more agreeable | Ztoog

    AI

    AI is already making online crimes easier. It could get much worse.

    AI

    NVIDIA Researchers Introduce KVTC Transform Coding Pipeline to Compress Key-Value Caches by 20x for Efficient LLM Serving

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Crypto

    IMX Rockets To $2.21, Scaling Heights Unseen Since 2022

    Despite market volatility, the IMX token fights the downward strain and continues its earlier upward…

    Mobile

    The Pixel Watch 2 and Fitbit Charge 6 are death knells for the Sense series

    (*2*)Kaitlyn Cimino / Android AuthorityWe’ve all had our eye on Fitbit for a while now,…

    Crypto

    Path To New All-Time High Set?

    On-chain information reveals Ethereum has efficiently discovered a rebound at a serious help zone, a…

    Gadgets

    TDK PiezoTap Switches For Touch-Enabled Products

    Touch performance has change into an integral a part of shopper electronics and varied different…

    Gadgets

    Judge Bans Destiny 2 Serial Cheater And Orders $500K Payment

    A choose has issued a big ruling in opposition to a serial cheater within the…

    Our Picks
    The Future

    X sets up team to generate $100m from political ads: reports

    Gadgets

    37 Best Back-to-School Deals (2023): Laptops, Backpacks, Household Essentials

    Technology

    SmileDirectClub Shut Down: What We Know About Payments and Finding New Treatment

    Categories
    • AI (1,560)
    • Crypto (1,827)
    • Gadgets (1,870)
    • Mobile (1,910)
    • Science (1,939)
    • Technology (1,862)
    • The Future (1,716)
    Most Popular
    Technology

    This Holdover Black Friday Deal Saves You $400 on an M2 MacBook Pro

    Science

    Daily Telescope: The Milky Way soars above Devil’s Kitchen

    Science

    How to Set Your Thermostat—According to Science

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2026 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.