Re-weighted gradient descent via distributionally robust optimization

Ramnath Kumar, Pre-Doctoral Researcher, and Arun Sai Suggala, Research Scientist, Google Research

Deep neural networks (DNNs) have change into important for fixing a variety of duties, from customary supervised studying (picture classification utilizing ViT) to meta-learning. The most commonly-used paradigm for studying DNNs is empirical threat minimization (ERM), which goals to establish a community that minimizes the common loss on coaching knowledge factors. Several algorithms, together with stochastic gradient descent (SGD), Adam, and Adagrad, have been proposed for fixing ERM. However, a downside of ERM is that it weights all of the samples equally, typically ignoring the uncommon and harder samples, and specializing in the better and ample samples. This results in suboptimal efficiency on unseen knowledge, particularly when the coaching knowledge is scarce.

To overcome this problem, current works have developed knowledge re-weighting methods for bettering ERM efficiency. However, these approaches concentrate on particular studying duties (similar to classification) and/or require studying a further meta mannequin that predicts the weights of every knowledge level. The presence of a further mannequin considerably will increase the complexity of coaching and makes them unwieldy in apply.

In “Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization” we introduce a variant of the classical SGD algorithm that re-weights knowledge factors throughout every optimization step primarily based on their issue. Stochastic Re-weighted Gradient Descent (RGD) is a light-weight algorithm that comes with a easy closed-form expression, and will be utilized to resolve any studying job utilizing simply two strains of code. At any stage of the training course of, RGD merely reweights a knowledge level because the exponential of its loss. We empirically show that the RGD reweighting algorithm improves the efficiency of quite a few studying algorithms throughout varied duties, starting from supervised studying to meta studying. Notably, we present enhancements over state-of-the-art strategies on DomainMattress and Tabular classification. Moreover, the RGD algorithm additionally boosts efficiency for BERT utilizing the GLUE benchmarks and ViT on ImageNet-1K.

Distributionally robust optimization

Distributionally robust optimization (DRO) is an strategy that assumes a “worst-case” knowledge distribution shift might happen, which might hurt a mannequin’s efficiency. If a mannequin has focussed on figuring out few spurious options for prediction, these “worst-case” knowledge distribution shifts may result in the misclassification of samples and, thus, a efficiency drop. DRO optimizes the loss for samples in that “worst-case” distribution, making the mannequin robust to perturbations (e.g., eradicating a small fraction of factors from a dataset, minor up/down weighting of knowledge factors, and so on.) within the knowledge distribution. In the context of classification, this forces the mannequin to put much less emphasis on noisy options and extra emphasis on helpful and predictive options. Consequently, fashions optimized utilizing DRO are likely to have higher generalization ensures and stronger efficiency on unseen samples.

Inspired by these outcomes, we develop the RGD algorithm as a method for fixing the DRO goal. Specifically, we concentrate on Kullback–Leibler divergence-based DRO, the place one provides perturbations to create distributions which might be near the unique knowledge distribution within the KL divergence metric, enabling a mannequin to carry out properly over all attainable perturbations.

Figure illustrating DRO. In distinction to ERM, which learns a mannequin that minimizes anticipated loss over unique knowledge distribution, DRO learns a mannequin that performs properly on a number of perturbed variations of the unique knowledge distribution.

Stochastic re-weighted gradient descent

Consider a random subset of samples (known as a mini-batch), the place every knowledge level has an related loss L_i. Traditional algorithms like SGD give equal significance to all of the samples within the mini-batch, and replace the parameters of the mannequin by descending alongside the averaged gradients of the lack of these samples. With RGD, we reweight every pattern within the mini-batch and provides extra significance to factors that the mannequin identifies as harder. To be exact, we use the loss as a proxy to calculate the issue of a degree, and reweight it by the exponential of its loss. Finally, we replace the mannequin parameters by descending alongside the weighted common of the gradients of the samples.

Due to stability concerns, in our experiments we clip and scale the loss earlier than computing its exponential. Specifically, we clip the loss at some threshold T, and multiply it with a scalar that’s inversely proportional to the edge. An essential side of RGD is its simplicity because it doesn’t depend on a meta mannequin to compute the weights of knowledge factors. Furthermore, it may be carried out with two strains of code, and mixed with any widespread optimizers (similar to SGD, Adam, and Adagrad.

Figure illustrating the intuitive thought behind RGD in a binary classification setting. Feature 1 and Feature 2 are the options obtainable to the mannequin for predicting the label of a knowledge level. RGD upweights the information factors with excessive losses which have been misclassified by the mannequin.

Results

We current empirical outcomes evaluating RGD with state-of-the-art methods on customary supervised studying and area adaptation (consult with the paper for outcomes on meta studying). In all our experiments, we tune the clipping degree and the training price of the optimizer utilizing a held-out validation set.

Supervised studying

We consider RGD on a number of supervised studying duties, together with language, imaginative and prescient, and tabular classification. For the duty of language classification, we apply RGD to the BERT mannequin educated on the General Language Understanding Evaluation (GLUE) benchmark and present that RGD outperforms the BERT baseline by +1.94% with a normal deviation of 0.42%. To consider RGD’s efficiency on imaginative and prescient classification, we apply RGD to the ViT-S mannequin educated on the ImageNet-1K dataset, and present that RGD outperforms the ViT-S baseline by +1.01% with a normal deviation of 0.23%. Moreover, we carry out speculation checks to verify that these outcomes are statistically important with a p-value that’s lower than 0.05.

RGD’s efficiency on language and imaginative and prescient classification utilizing GLUE and Imagenet-1K benchmarks. Note that MNLI, QQP, QNLI, SST-2, MRPC, RTE and COLA are various datasets which comprise the GLUE benchmark.

For tabular classification, we use MET as our baseline, and take into account varied binary and multi-class datasets from UC Irvine’s machine studying repository. We present that making use of RGD to the MET framework improves its efficiency by 1.51% and 1.27% on binary and multi-class tabular classification, respectively, attaining state-of-the-art efficiency on this area.

Performance of RGD for classification of varied tabular datasets.

Domain generalization

To consider RGD’s generalization capabilities, we use the usual DomainMattress benchmark, which is often used to review a mannequin’s out-of-domain efficiency. We apply RGD to FRR, a current strategy that improved out-of-domain benchmarks, and present that RGD with FRR performs a mean of 0.7% higher than the FRR baseline. Furthermore, we verify with speculation checks that almost all benchmark outcomes (apart from Office Home) are statistically important with a p-value lower than 0.05.

Performance of RGD on DomainMattress benchmark for distributional shifts.

Class imbalance and equity

To show that fashions realized utilizing RGD carry out properly regardless of class imbalance, the place sure courses within the dataset are underrepresented, we examine RGD’s efficiency with ERM on long-tailed CIFAR-10. We report that RGD improves the accuracy of baseline ERM by a mean of two.55% with a normal deviation of 0.23%. Furthermore, we carry out speculation checks and make sure that these outcomes are statistically important with a p-value of lower than 0.05.

Performance of RGD on the long-tailed Cifar-10 benchmark for sophistication imbalance area.

Limitations

The RGD algorithm was developed utilizing widespread analysis datasets, which have been already curated to take away corruptions (e.g., noise and incorrect labels). Therefore, RGD might not present efficiency enhancements in eventualities the place coaching knowledge has a excessive quantity of corruptions. A possible strategy to deal with such eventualities is to use an outlier removing approach to the RGD algorithm. This outlier removing approach needs to be able to filtering out outliers from the mini-batch and sending the remaining factors to our algorithm.

Conclusion

RGD has been proven to be efficient on quite a lot of duties, together with out-of-domain generalization, tabular illustration studying, and sophistication imbalance. It is straightforward to implement and will be seamlessly built-in into current algorithms with simply two strains of code change. Overall, RGD is a promising approach for reinforcing the efficiency of DNNs, and will assist push the boundaries in varied domains.

Acknowledgements

The paper described on this weblog submit was written by Ramnath Kumar, Arun Sai Suggala, Dheeraj Nagaraj and Kushal Majmundar. We prolong our honest gratitude to the nameless reviewers, Prateek Jain, Pradeep Shenoy, Anshul Nasery, Lovish Madaan, and the quite a few devoted members of the machine studying and optimization crew at Google Research India for his or her invaluable suggestions and contributions to this work.

What's Hot

Important Pages:

Re-weighted gradient descent via distributionally robust optimization – Google Research Blog