Close Menu
Ztoog
    What's Hot
    Gadgets

    The best electric commuter bikes for 2024

    Mobile

    Amazon Echo Pop review: Alexa gets even cheaper

    AI

    Collaborative learning with large language models – Google Research Blog

    Important Pages:
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    Facebook X (Twitter) Instagram Pinterest
    Facebook X (Twitter) Instagram Pinterest
    Ztoog
    • Home
    • The Future

      OPPO launches A5 Pro 5G: Premium features at a budget price

      How I Turn Unstructured PDFs into Revenue-Ready Spreadsheets

      Is it the best tool for 2025?

      The clocks that helped define time from London’s Royal Observatory

      Summer Movies Are Here, and So Are the New Popcorn Buckets

    • Technology

      What It Is and Why It Matters—Part 1 – O’Reilly

      Ensure Hard Work Is Recognized With These 3 Steps

      Cicada map 2025: Where will Brood XIV cicadas emerge this spring?

      Is Duolingo the face of an AI jobs crisis?

      The US DOD transfers its AI-based Open Price Exploration for National Security program to nonprofit Critical Minerals Forum to boost Western supply deals (Ernest Scheyder/Reuters)

    • Gadgets

      Maono Caster G1 Neo & PD200X Review: Budget Streaming Gear for Aspiring Creators

      Apple plans to split iPhone 18 launch into two phases in 2026

      Upgrade your desk to Starfleet status with this $95 USB-C hub

      37 Best Graduation Gift Ideas (2025): For College Grads

      Backblaze responds to claims of “sham accounting,” customer backups at risk

    • Mobile

      Samsung Galaxy S25 Edge promo materials leak

      What are people doing with those free T-Mobile lines? Way more than you’d expect

      Samsung doesn’t want budget Galaxy phones to use exclusive AI features

      COROS’s charging adapter is a neat solution to the smartwatch charging cable problem

      Fortnite said to return to the US iOS App Store next week following court verdict

    • Science

      Nothing is stronger than quantum connections – and now we know why

      Failed Soviet probe will soon crash to Earth – and we don’t know where

      Trump administration cuts off all future federal funding to Harvard

      Does kissing spread gluten? New research offers a clue.

      Why Balcony Solar Panels Haven’t Taken Off in the US

    • AI

      Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

      How to build a better AI benchmark

      Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

      This data set helps researchers spot harmful stereotypes in LLMs

      Making AI models more trustworthy for high-stakes settings | Ztoog

    • Crypto

      Ethereum Breaks Key Resistance In One Massive Move – Higher High Confirms Momentum

      ‘The Big Short’ Coming For Bitcoin? Why BTC Will Clear $110,000

      Bitcoin Holds Above $95K Despite Weak Blockchain Activity — Analytics Firm Explains Why

      eToro eyes US IPO launch as early as next week amid easing concerns over Trump’s tariffs

      Cardano ‘Looks Dope,’ Analyst Predicts Big Move Soon

    Ztoog
    Home » Achieving scalability and quality in text clustering – Google Research Blog
    AI

    Achieving scalability and quality in text clustering – Google Research Blog

    Facebook Twitter Pinterest WhatsApp
    Achieving scalability and quality in text clustering – Google Research Blog
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp

    Posted by Sara Ahmadian and Mehran Kazemi, Research Scientists, Google Research

    Clustering is a elementary, ubiquitous downside in knowledge mining and unsupervised machine studying, the place the aim is to group collectively related gadgets. The customary types of clustering are metric clustering and graph clustering. In metric clustering, a given metric house defines distances between knowledge factors, that are grouped collectively based mostly on their separation. In graph clustering, a given graph connects related knowledge factors by edges, and the clustering course of teams knowledge factors collectively based mostly on the connections between them. Both clustering varieties are notably helpful for giant corpora the place class labels can’t be outlined. Examples of such corpora are the ever-growing digital text collections of assorted web platforms, with functions together with organizing and looking out paperwork, figuring out patterns in text, and recommending related paperwork to customers (see extra examples in the next posts: clustering associated queries based mostly on person intent and sensible differentially personal clustering).

    The alternative of text clustering technique typically presents a dilemma. One strategy is to make use of embedding fashions, resembling BERT or RoBERTa, to outline a metric clustering downside. Another is to make the most of cross-attention (CA) fashions, resembling PaLM or GPT, to outline a graph clustering downside. CA fashions can present extremely correct similarity scores, however setting up the enter graph might require a prohibitive quadratic variety of inference calls to the mannequin. On the opposite hand, a metric house can effectively be outlined by distances of embeddings produced by embedding fashions. However, these similarity distances are sometimes of considerable lower-quality in comparison with the similarity indicators of CA fashions, and therefore the produced clustering will be of a lot lower-quality.

    An overview of the embedding-based and cross-attention–based mostly similarity scoring capabilities and their scalability vs. quality dilemma.

    Motivated by this, in “KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals”, introduced at ICLR 2023, we describe a novel clustering algorithm that successfully combines the scalability advantages from embedding fashions and the quality from CA fashions. This graph clustering algorithm has question entry to each the CA mannequin and the embedding mannequin, nevertheless, we apply a funds on the variety of queries made to the CA mannequin. This algorithm makes use of the CA mannequin to reply edge queries, and advantages from limitless entry to similarity scores from the embedding mannequin. We describe how this proposed setting bridges algorithm design and sensible issues, and will be utilized to different clustering issues with related accessible scoring capabilities, resembling clustering issues on photographs and media. We reveal how this algorithm yields high-quality clusters with nearly a linear variety of question calls to the CA mannequin. We have additionally open-sourced the information used in our experiments.

    The clustering algorithm

    The KwikBucks algorithm is an extension of the well-known KwikCluster algorithm (Pivot algorithm). The high-level thought is to first choose a set of paperwork (i.e., facilities) with no similarity edge between them, and then type clusters round these facilities. To acquire the quality from CA fashions and the runtime effectivity from embedding fashions, we introduce the novel combo similarity oracle mechanism. In this strategy, we make the most of the embedding mannequin to information the number of queries to be despatched to the CA mannequin. When given a set of middle paperwork and a goal doc, the combo similarity oracle mechanism outputs a middle from the set that’s just like the goal doc, if current. The combo similarity oracle allows us to save lots of on funds by limiting the variety of question calls to the CA mannequin when deciding on facilities and forming clusters. It does this by first rating facilities based mostly on their embedding similarity to the goal doc, and then querying the CA mannequin for the pair (i.e., goal doc and ranked middle), as proven beneath.

    A combo similarity oracle that for a set of paperwork and a goal doc, returns the same doc from the set, if current.

    We then carry out a put up processing step to merge clusters if there’s a robust connection between two of them, i.e., when the variety of connecting edges is greater than the variety of lacking edges between two clusters. Additionally, we apply the next steps for additional computational financial savings on queries made to the CA mannequin, and to enhance efficiency at runtime:

    1. We leverage query-efficient correlation clustering to type a set of facilities from a set of randomly chosen paperwork as an alternative of choosing these facilities from all of the paperwork (in the illustration beneath, the middle nodes are crimson).
    2. We apply the combo similarity oracle mechanism to carry out the cluster project step in parallel for all non-center paperwork and depart paperwork with no related middle as singletons. In the illustration beneath, the assignments are depicted by blue arrows and initially two (non-center) nodes are left as singletons attributable to no project.
    3. In the post-processing step, to make sure scalability, we use the embedding similarity scores to filter down the potential mergers (in the illustration beneath, the inexperienced dashed boundaries present these merged clusters).

    Illustration of progress of the clustering algorithm on a given graph occasion.

    Results

    We consider the novel clustering algorithm on varied datasets with completely different properties utilizing completely different embedding-based and cross-attention–based mostly fashions. We evaluate the clustering algorithm’s efficiency with the 2 finest performing baselines (see the paper for extra particulars):

    To consider the quality of clustering, we use precision and recall. Precision is used to calculate the proportion of comparable pairs out of all co-clustered pairs and recall is the proportion of co-clustered related pairs out of all related pairs. To measure the quality of the obtained options from our experiments, we use the F1-score, which is the harmonic imply of the precision and recall, the place 1.0 is the best potential worth that signifies good precision and recall, and 0 is the bottom potential worth that signifies if both precision or recall are zero. The desk beneath reviews the F1-score for Kwikbucks and varied baselines in the case that we enable solely a linear variety of queries to the CA mannequin. We present that Kwikbucks affords a considerable increase in efficiency with a forty five% relative enchancment in comparison with the perfect baseline when averaging throughout all datasets.

    The determine beneath compares the clustering algorithm’s efficiency with baselines utilizing completely different question budgets. We observe that KwikBucks constantly outperforms different baselines at varied budgets.

    A comparability of KwikBucks with top-2 baselines when allowed completely different budgets for querying the cross-attention mannequin.

    Conclusion

    Text clustering typically presents a dilemma in the selection of similarity perform: embedding fashions are scalable however lack quality, whereas cross-attention fashions provide quality however considerably harm scalability. We current a clustering algorithm that gives the perfect of each worlds: the scalability of embedding fashions and the quality of cross-attention fashions. KwikBucks may also be utilized to different clustering issues with a number of similarity oracles of various accuracy ranges. This is validated with an exhaustive set of experiments on varied datasets with numerous properties. See the paper for extra particulars.

    Acknowledgements

    This mission was initiated throughout Sandeep Silwal’s summer time internship at Google in 2022. We wish to specific our gratitude to our co-authors, Andrew McCallum, Andrew Nystrom, Deepak Ramachandran, and Sandeep Silwal, for his or her priceless contributions to this work. We additionally thank Ravi Kumar and John Guilyard for help with this weblog put up.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp

    Related Posts

    AI

    Hybrid AI model crafts smooth, high-quality videos in seconds | Ztoog

    AI

    How to build a better AI benchmark

    AI

    Q&A: A roadmap for revolutionizing health care through data-driven innovation | Ztoog

    AI

    This data set helps researchers spot harmful stereotypes in LLMs

    AI

    Making AI models more trustworthy for high-stakes settings | Ztoog

    AI

    The AI Hype Index: AI agent cyberattacks, racing robots, and musical models

    AI

    Novel method detects microbial contamination in cell cultures | Ztoog

    AI

    Seeing AI as a collaborator, not a creator

    Leave A Reply Cancel Reply

    Follow Us
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    Top Posts
    Technology

    Preparing for The Bitcoin Halving Event Projected in April 2024

    Bitcoin halving is an occasion that takes place roughly each 4 years, or extra exactly…

    Mobile

    We wish this rollable phone concept was a thing you could buy

    TL;DR TECNO has introduced the Phantom Ultimate concept phone. This is a rollable phone, increasing…

    Gadgets

    New LG TVs relegate I/O to a box you can set 30 feet from the screen

    (*30*) You can’t inform from this image, however each the TV and port box on…

    The Future

    Bletchley Park AI summit 2023: G7 countries agree AI code of conduct

    G7 nations are eager to work collectively on regulating synthetic intelligenceYuki Kurose/ AP / Alamy…

    Crypto

    Ethereum Resilient Above $1,800 Pre-FOMC Meeting

    Ethereum (ETH), one of many main cryptocurrencies, is displaying exceptional resilience within the face of…

    Our Picks
    Crypto

    Crypto In Hollywood: BTC, DOGE, and SHIB Accepted For Tickets To Taylor Swift’s New Movie

    Technology

    Israel-Hamas war, explained by Vox podcasts

    Gadgets

    Apple will require app devs to explain exactly why they use certain APIs

    Categories
    • AI (1,483)
    • Crypto (1,745)
    • Gadgets (1,796)
    • Mobile (1,839)
    • Science (1,854)
    • Technology (1,790)
    • The Future (1,636)
    Most Popular
    Mobile

    Release date, plot, and other rumors

    The Future

    How to tell if your solar eclipse glasses are safe or fake

    Mobile

    Beats Studio Buds + announced with improved ANC and longer battery life

    Ztoog
    Facebook X (Twitter) Instagram Pinterest
    • Home
    • About Us
    • Contact us
    • Privacy Policy
    • Terms & Conditions
    © 2025 Ztoog.

    Type above and press Enter to search. Press Esc to cancel.