Close Menu
Ztoog
    What's Hot
    Crypto

    Panel Of Market Experts Predict When Ethereum Price Will Cross $14,000

    AI

    Multimodal medical AI – Google Research Blog

    The Future

    India pressed Apple on state-sponsored warnings, report says

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      Any wall can be turned into a camera to see around corners

      JD Vance and President Trump’s Sons Hype Bitcoin at Las Vegas Conference

      AI may already be shrinking entry-level jobs in tech, new research suggests

      Today’s NYT Strands Hints, Answer and Help for May 26 #449

      LiberNovo Omni: The World’s First Dynamic Ergonomic Chair

    • Technology

      A Replit employee details a critical security flaw in web apps created using AI-powered app builder Lovable that exposes API keys and personal info of app users (Reed Albergotti/Semafor)

      Gemini in Google Drive can now help you skip watching that painfully long Zoom meeting

      Apple iPhone exports from China to the US fall 76% as India output surges

      Today’s NYT Wordle Hints, Answer and Help for May 26, #1437

      5 Skills Kids (and Adults) Need in an AI World – O’Reilly

    • Gadgets

      8 Best Vegan Meal Delivery Services and Kits (2025), Tested and Reviewed

      Google Home is getting deeper Gemini integration and a new widget

      Google Announces AI Ultra Subscription Plan With Premium Features

      Google shows off Android XR-based glasses, announces Warby Parker team-up

      The market’s down, but this OpenAI for the stock market can help you trade up

    • Mobile

      Microsoft is done being subtle – this new tool screams “upgrade now”

      Wallpaper Wednesday: Android wallpapers 2025-05-28

      Google can make smart glasses accessible with Warby Parker, Gentle Monster deals

      vivo T4 Ultra specs leak

      Forget screens: more details emerge on the mysterious Jony Ive + OpenAI device

    • Science

      Analysts Say Trump Trade Wars Would Harm the Entire US Energy Sector, From Oil to Solar

      Do we have free will? Quantum experiments may soon reveal the answer

      Was Planet Nine exiled from the solar system as a baby?

      How farmers can help rescue water-loving birds

      A trip to the farm where loofahs grow on vines

    • AI

      Rationale engineering generates a compact new tool for gene therapy | Ztoog

      The AI Hype Index: College students are hooked on ChatGPT

      Learning how to predict rare kinds of failures | Ztoog

      Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

      AI learns how vision and sound are connected, without human intervention | Ztoog

    • Crypto

      GameStop bought $500 million of bitcoin

      CoinW Teams Up with Superteam Europe to Conclude Solana Hackathon and Accelerate Web3 Innovation in Europe

      Ethereum Net Flows Turn Negative As Bulls Push For $3,500

      Bitcoin’s Power Compared To Nuclear Reactor By Brazilian Business Leader

      Senate advances GENIUS Act after cloture vote passes

    Ztoog
    Home » Unifying image-caption and image-classification datasets with prefix conditioning – Ztoog
    AI

    Unifying image-caption and image-classification datasets with prefix conditioning – Ztoog

    Facebook Twitter Pinterest WhatsApp
    Unifying image-caption and image-classification datasets with prefix conditioning – Ztoog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Kuniaki Saito, Student Researcher, Cloud AI Team, and Kihyuk Sohn, Research Scientist, Perception Team

    Pre-training visible language (VL) fashions on web-scale image-caption datasets has lately emerged as a robust different to conventional pre-training on picture classification knowledge. Image-caption datasets are thought of to be extra “open-domain” as a result of they comprise broader scene sorts and vocabulary phrases, which lead to fashions with sturdy efficiency in few- and zero-shot recognition duties. However, pictures with fine-grained class descriptions will be uncommon, and the category distribution will be imbalanced since image-caption datasets don’t undergo guide curation. By distinction, large-scale classification datasets, corresponding to ImageInternet, are sometimes curated and can thus present fine-grained classes with a balanced label distribution. While it might sound promising, immediately combining caption and classification datasets for pre-training is commonly unsuccessful because it may end up in biased representations that don’t generalize properly to varied downstream duties.

    In “Prefix Conditioning Unifies Language and Label Supervision”, introduced at CVPR 2023, we reveal a pre-training technique that makes use of each classification and caption datasets to offer complementary advantages. First, we present that naïvely unifying the datasets ends in sub-optimal efficiency on downstream zero-shot recognition duties because the mannequin is affected by dataset bias: the protection of picture domains and vocabulary phrases is completely different in every dataset. We tackle this downside throughout coaching by prefix conditioning, a novel easy and efficient technique that makes use of prefix tokens to disentangle dataset biases from visible ideas. This method permits the language encoder to study from each datasets whereas additionally tailoring function extraction to every dataset. Prefix conditioning is a generic technique that may be simply built-in into current VL pre-training goals, corresponding to Contrastive Language-Image Pre-training (CLIP) or Unified Contrastive Learning (UniCL).

    High-level concept

    We be aware that classification datasets are typically biased in at the least two methods: (1) the pictures principally comprise single objects from restricted domains, and (2) the vocabulary is proscribed and lacks the linguistic flexibility required for zero-shot studying. For instance, the category embedding of “a photo of a dog” optimized for ImageInternet often ends in a photograph of 1 canine within the middle of the picture pulled from the ImageInternet dataset, which doesn’t generalize properly to different datasets containing pictures of a number of canine in numerous spatial areas or a canine with different topics.

    By distinction, caption datasets comprise a greater diversity of scene sorts and vocabularies. As proven beneath, if a mannequin merely learns from two datasets, the language embedding can entangle the bias from the picture classification and caption dataset, which may lower the generalization in zero-shot classification. If we are able to disentangle the bias from two datasets, we are able to use language embeddings which are tailor-made for the caption dataset to enhance generalization.

    Top: Language embedding entangling the bias from picture classification and caption dataset. Bottom: Language embeddings disentangles the bias from two datasets.

    Prefix conditioning

    Prefix conditioning is partially impressed by immediate tuning, which prepends learnable tokens to the enter token sequences to instruct a pre-trained mannequin spine to study task-specific information that can be utilized to resolve downstream duties. The prefix conditioning method differs from immediate tuning in two methods: (1) it’s designed to unify image-caption and classification datasets by disentangling the dataset bias, and (2) it’s utilized to VL pre-training whereas the usual immediate tuning is used to fine-tune fashions. Prefix conditioning is an express strategy to particularly steer the conduct of mannequin backbones primarily based on the kind of datasets offered by customers. This is very useful in manufacturing when the variety of various kinds of datasets is understood forward of time.

    During coaching, prefix conditioning learns a textual content token (prefix token) for every dataset sort, which absorbs the bias of the dataset and permits the remaining textual content tokens to deal with studying visible ideas. Specifically, it prepends prefix tokens for every dataset sort to the enter tokens that inform the language and visible encoder of the enter knowledge sort (e.g., classification vs. caption). Prefix tokens are educated to study the dataset-type-specific bias, which allows us to disentangle that bias in language representations and make the most of the embedding discovered on the image-caption dataset throughout take a look at time, even with out an enter caption.

    We make the most of prefix conditioning for CLIP utilizing a language and visible encoder. During take a look at time, we make use of the prefix used for the image-caption dataset because the dataset is meant to cowl broader scene sorts and vocabulary phrases, main to higher efficiency in zero-shot recognition.

    Illustration of the Prefix Conditioning.

    Experimental outcomes

    We apply prefix conditioning to 2 kinds of contrastive loss, CLIP and UniCL, and consider their efficiency on zero-shot recognition duties in comparison with fashions educated with ImageNet21K (IN21K) and Conceptual 12M (CC12M). CLIP and UniCL fashions educated with two datasets utilizing prefix conditioning present massive enhancements in zero-shot classification accuracy.

    Zero-shot classification accuracy of fashions educated with solely IN21K or CC12M in comparison with CLIP and UniCL fashions educated with each two datasets utilizing prefix conditioning (“Ours”).

    Study on test-time prefix

    The desk beneath describes the efficiency change by the prefix used throughout take a look at time. We reveal that by utilizing the identical prefix used for the classification dataset (“Prompt”), the efficiency on the classification dataset (IN-1K) improves. When utilizing the identical prefix used for the image-caption dataset (“Caption”), the efficiency on different datasets (Zero-shot AVG) improves. This evaluation illustrates that if the prefix is tailor-made for the image-caption dataset, it achieves higher generalization of scene sorts and vocabulary phrases.

    Analysis of the prefix used for test-time.

    Study on robustness to picture distribution shift

    We research the shift in picture distribution utilizing ImageInternet variants. We see that the “Caption” prefix performs higher than “Prompt” in ImageInternet-R (IN-R) and ImageInternet-Sketch (IN-S), however underperforms in ImageInternet-V2 (IN-V2). This signifies that the “Caption” prefix achieves generalization on domains removed from the classification dataset. Therefore, the optimum prefix most likely differs by how far the take a look at area is from the classification dataset.

    Analysis on the robustness to image-level distribution shift. IN: ImageInternet, IN-V2: ImageInternet-V2, IN-R: Art, Cartoon model ImageInternet, IN-S: ImageInternet Sketch.

    Conclusion and future work

    We introduce prefix conditioning, a method for unifying picture caption and classification datasets for higher zero-shot classification. We present that this method results in higher zero-shot classification accuracy and that the prefix can management the bias within the language embedding. One limitation is that the prefix discovered on the caption dataset just isn’t essentially optimum for the zero-shot classification. Identifying the optimum prefix for every take a look at dataset is an attention-grabbing path for future work.

    Acknowledgements

    This analysis was carried out by Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. Thanks to Zizhao Zhang and Sergey Ioffe for his or her priceless suggestions.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Rationale engineering generates a compact new tool for gene therapy | Ztoog

    AI

    The AI Hype Index: College students are hooked on ChatGPT

    AI

    Learning how to predict rare kinds of failures | Ztoog

    AI

    Anthropic’s new hybrid AI model can work on tasks autonomously for hours at a time

    AI

    AI learns how vision and sound are connected, without human intervention | Ztoog

    AI

    How AI is introducing errors into courtrooms

    AI

    With AI, researchers predict the location of virtually any protein within a human cell | Ztoog

    AI

    Google DeepMind’s new AI agent cracks real-world problems better than humans can

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    The Future

    Best Cellphone Plans of 2023: Our Top Picks for June

    This is a little more sophisticated. Verizon was our decide with its Play More plan…

    Mobile

    Best Android tips and tricks in 2023

    Robert Triggs / Android AuthorityAndroid is by far the preferred OS in the world, and…

    Gadgets

    Hands-on with the BlackBerry-style Clicks keyboard for iPhone

    I remorse to tell you that i can’t tyoe on thjdi factor but. The Clicks keyboard case has…

    Gadgets

    48 Best Back-to-School Deals (2023): Laptops, Backpacks, Household Essentials

    Summer is Fading away, and faculty is nearly again in session (for some, it is…

    Gadgets

    What to expect from the Apple Vision Pro in February

    Enlarge / Apple Vision Pro.Apple After years of delays, preorders for the Apple Vision Pro…

    Our Picks
    Mobile

    OnePlus 13, Oppo Find X8, and Realme GT6 Pro to get 6,000 mAh batteries

    Mobile

    6 phones to consider before you buy

    Crypto

    Litecoin Whales Are Back In The Game, Can Price Reach $100?

    Categories
    • AI (1,493)
    • Crypto (1,753)
    • Gadgets (1,804)
    • Mobile (1,850)
    • Science (1,866)
    • Technology (1,802)
    • The Future (1,648)
    Most Popular
    Mobile

    As a MagSafe fan, I can’t wait for Qi2 to come to Android

    AI

    Alternating updates for efficient transformers – Google Research Blog

    Mobile

    Doogee T30 Ultra, T20 Ultra and T20mini Pro tablets announced

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.