Achieving scalability and quality in text clustering

Posted by Sara Ahmadian and Mehran Kazemi, Research Scientists, Google Research

Clustering is a elementary, ubiquitous downside in knowledge mining and unsupervised machine studying, the place the aim is to group collectively related gadgets. The customary types of clustering are metric clustering and graph clustering. In metric clustering, a given metric house defines distances between knowledge factors, that are grouped collectively based mostly on their separation. In graph clustering, a given graph connects related knowledge factors by edges, and the clustering course of teams knowledge factors collectively based mostly on the connections between them. Both clustering varieties are notably helpful for giant corpora the place class labels can’t be outlined. Examples of such corpora are the ever-growing digital text collections of assorted web platforms, with functions together with organizing and looking out paperwork, figuring out patterns in text, and recommending related paperwork to customers (see extra examples in the next posts: clustering associated queries based mostly on person intent and sensible differentially personal clustering).

The alternative of text clustering technique typically presents a dilemma. One strategy is to make use of embedding fashions, resembling BERT or RoBERTa, to outline a metric clustering downside. Another is to make the most of cross-attention (CA) fashions, resembling PaLM or GPT, to outline a graph clustering downside. CA fashions can present extremely correct similarity scores, however setting up the enter graph might require a prohibitive quadratic variety of inference calls to the mannequin. On the opposite hand, a metric house can effectively be outlined by distances of embeddings produced by embedding fashions. However, these similarity distances are sometimes of considerable lower-quality in comparison with the similarity indicators of CA fashions, and therefore the produced clustering will be of a lot lower-quality.

An overview of the embedding-based and cross-attention–based mostly similarity scoring capabilities and their scalability vs. quality dilemma.

Motivated by this, in “KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals”, introduced at ICLR 2023, we describe a novel clustering algorithm that successfully combines the scalability advantages from embedding fashions and the quality from CA fashions. This graph clustering algorithm has question entry to each the CA mannequin and the embedding mannequin, nevertheless, we apply a funds on the variety of queries made to the CA mannequin. This algorithm makes use of the CA mannequin to reply edge queries, and advantages from limitless entry to similarity scores from the embedding mannequin. We describe how this proposed setting bridges algorithm design and sensible issues, and will be utilized to different clustering issues with related accessible scoring capabilities, resembling clustering issues on photographs and media. We reveal how this algorithm yields high-quality clusters with nearly a linear variety of question calls to the CA mannequin. We have additionally open-sourced the information used in our experiments.

The clustering algorithm

The KwikBucks algorithm is an extension of the well-known KwikCluster algorithm (Pivot algorithm). The high-level thought is to first choose a set of paperwork (i.e., facilities) with no similarity edge between them, and then type clusters round these facilities. To acquire the quality from CA fashions and the runtime effectivity from embedding fashions, we introduce the novel combo similarity oracle mechanism. In this strategy, we make the most of the embedding mannequin to information the number of queries to be despatched to the CA mannequin. When given a set of middle paperwork and a goal doc, the combo similarity oracle mechanism outputs a middle from the set that’s just like the goal doc, if current. The combo similarity oracle allows us to save lots of on funds by limiting the variety of question calls to the CA mannequin when deciding on facilities and forming clusters. It does this by first rating facilities based mostly on their embedding similarity to the goal doc, and then querying the CA mannequin for the pair (i.e., goal doc and ranked middle), as proven beneath.

A combo similarity oracle that for a set of paperwork and a goal doc, returns the same doc from the set, if current.

We then carry out a put up processing step to merge clusters if there’s a robust connection between two of them, i.e., when the variety of connecting edges is greater than the variety of lacking edges between two clusters. Additionally, we apply the next steps for additional computational financial savings on queries made to the CA mannequin, and to enhance efficiency at runtime:

We leverage query-efficient correlation clustering to type a set of facilities from a set of randomly chosen paperwork as an alternative of choosing these facilities from all of the paperwork (in the illustration beneath, the middle nodes are crimson).
We apply the combo similarity oracle mechanism to carry out the cluster project step in parallel for all non-center paperwork and depart paperwork with no related middle as singletons. In the illustration beneath, the assignments are depicted by blue arrows and initially two (non-center) nodes are left as singletons attributable to no project.
In the post-processing step, to make sure scalability, we use the embedding similarity scores to filter down the potential mergers (in the illustration beneath, the inexperienced dashed boundaries present these merged clusters).

Illustration of progress of the clustering algorithm on a given graph occasion.

Results

We consider the novel clustering algorithm on varied datasets with completely different properties utilizing completely different embedding-based and cross-attention–based mostly fashions. We evaluate the clustering algorithm’s efficiency with the 2 finest performing baselines (see the paper for extra particulars):

To consider the quality of clustering, we use precision and recall. Precision is used to calculate the proportion of comparable pairs out of all co-clustered pairs and recall is the proportion of co-clustered related pairs out of all related pairs. To measure the quality of the obtained options from our experiments, we use the F1-score, which is the harmonic imply of the precision and recall, the place 1.0 is the best potential worth that signifies good precision and recall, and 0 is the bottom potential worth that signifies if both precision or recall are zero. The desk beneath reviews the F1-score for Kwikbucks and varied baselines in the case that we enable solely a linear variety of queries to the CA mannequin. We present that Kwikbucks affords a considerable increase in efficiency with a forty five% relative enchancment in comparison with the perfect baseline when averaging throughout all datasets.

The determine beneath compares the clustering algorithm’s efficiency with baselines utilizing completely different question budgets. We observe that KwikBucks constantly outperforms different baselines at varied budgets.

A comparability of KwikBucks with top-2 baselines when allowed completely different budgets for querying the cross-attention mannequin.

Conclusion

Text clustering typically presents a dilemma in the selection of similarity perform: embedding fashions are scalable however lack quality, whereas cross-attention fashions provide quality however considerably harm scalability. We current a clustering algorithm that gives the perfect of each worlds: the scalability of embedding fashions and the quality of cross-attention fashions. KwikBucks may also be utilized to different clustering issues with a number of similarity oracles of various accuracy ranges. This is validated with an exhaustive set of experiments on varied datasets with numerous properties. See the paper for extra particulars.

Acknowledgements

This mission was initiated throughout Sandeep Silwal’s summer time internship at Google in 2022. We wish to specific our gratitude to our co-authors, Andrew McCallum, Andrew Nystrom, Deepak Ramachandran, and Sandeep Silwal, for his or her priceless contributions to this work. We additionally thank Ravi Kumar and John Guilyard for help with this weblog put up.

What's Hot

Important Pages:

Achieving scalability and quality in text clustering – Google Research Blog

The clustering algorithm

Results

Conclusion

Acknowledgements

Related Posts