Sashank Reddi
Research Areas
Authored Publications
Sort By
Efficient Training of Language Models using Few-Shot Learning
Shankar Krishnan
Satyen Kale
ICML (2023)
Preview abstract
Large deep learning models have achieved state-of-the-art performance across various natural language processing (NLP) tasks and demonstrated remarkable few-shot learning performance. However, training them is often challenging and resource-intensive. In this paper, we study an efficient approach to train language models using few-shot learners. We show that, by leveraging the fast learning nature of few-shot learners, one can train language models efficiently in a stagewise manner. Our main insight is that stacking a good few-shot learner on a good small language model provides a good initializer for a larger language model. Using this insight and building upon progressive stacking approaches, we develop novel approaches for training such networks in a stagewise manner. Furthermore, we also provide a theoretical framework and accompanying empirical studies to support our insights, thereby creating a theoretical foundation for progressive stacking. Finally, we provide empirical results to demonstrate the effectiveness of our approach in reducing the training time of few-shot learners.
View details
On Emergence of Activation Sparsity in Trained Transformers
Zonglin Li
Chong You
Daliang Li
Ke Ye
International Conference on Learning Representations (ICLR) (2023)
Preview abstract
This paper reveals a curious observation that modern large-scale machine learning models with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by ``sparse'' we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, etc. Moreover, larger Transformers with more layers and higher MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. To probe why sparsity emerges, we design experiments with random labels, random images, and infinite data, and find that sparsity may be due primarily to optimization while has little to do with the properties of training dataset. We discuss how sparsity immediately implies a means for significantly reducing the FLOP count and improving efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that explicitly enforcing an even sparser activation via Top-K thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
View details
In defense of dual-encoders for neural ranking
Aditya Krishna Menon
Sadeep Jayasumana
International Conference on Machine Learning (ICML) (2022)
Preview abstract
Transformer-based models such as BERT have proven successful in information retrieval problem, which seek to identify relevant documents for a given query. There are two broad flavours of such models: cross-attention (CA) models, which learn a joint embedding for the query and document, and dual-encoder (DE) models, which learn separate embeddings for the query and document. Empirically, CA models are often found to be more accurate, which has motivated a series of works seeking to bridge this gap. However, a more fundamental question remains less explored: does this performance gap reflect an inherent limitation in the capacity of DE models, or a limitation in the training of such models? And does such an understanding suggest a principled means of improving DE models? In this paper, we study these questions, with three contributions. First, we establish theoretically that with a sufficiently large embedding dimension, DE models have the capacity to model a broad class of score distributions. Second, we show empirically that on real-world problems, DE models may overfit to spurious correlations in the training set, and thus under-perform on test samples. To mitigate this behaviour, we propose a novel distillation strategy that leverages confidence margins, and confirm its practical efficacy on the MSMARCO-Passage benchmark.
View details
Adaptive Federated Optimization
Jakub Konečný
(2021)
Preview abstract
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Yogi and Adam, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can improve the performance of federated learning.
View details
RankDistil: Distillation for Ranking
Aditya Krishna Menon
AISTATS 2021 (2021)
Preview abstract
Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a framework for distillation for top-k ranking and establish connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the k largest scores of student and teacher while penalizing large scores for items ranked low by the teacher. Building on our framework, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems.
View details
Efficient Training of Retrieval Models using Negative Cache
Erik Lindgren
Neural Information Processing Systems 2021 (2021)
Preview abstract
Factorized models, such as two tower neural network models, are widely used for
scoring (query, document) pairs in information retrieval tasks. These models are
typically trained by optimizing the model parameters to score relevant “positive"
pairs higher than the irrelevant “negative" ones. While a large set of negatives
typically improves the model performance, limited computation and memory
budgets place constraints on the number of negatives used during training. In this
paper, we develop a novel negative sampling technique for accelerating training
with softmax cross-entropy loss. By using cached (possibly stale) item embeddings,
our technique enables training with a large pool of negatives with reduced memory
and computation. We also develop a streaming variant of our algorithm geared
towards very large datasets. Furthermore, we establish a theoretical basis for our
approach by showing that updating a very small fraction of the cache at each
iteration can still ensure fast convergence. Finally, we experimentally validate our
approach and show that it is efficient and compares favorably with more complex,
state-of-the-art approaches.
View details
A statistical perspective on distillation
Aditya Krishna Menon
International Conference on Machine Learning (ICML) 2021
Preview abstract
Knowledge distillation is a technique for improving a ``student'' model by replacing its one-hot training labels with a label distribution obtained from a ``teacher'' model. Despite its broad success, several basic questions --- e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? --- have received limited formal study. In this paper, we present a statistical perspective on distillation which provides an answer to these questions. Our core observation is that a ``Bayes teacher'' providing the true class-probabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the value of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a ``good'' teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.
View details
A Field Guide to Federated Optimization
Jianyu Wang
Gauri Joshi
Maruan Al-Shedivat
Galen Andrew
A. Salman Avestimehr
Katharine Daly
Deepesh Data
Suhas Diggavi
Hubert Eichner
Advait Gadhikar
Antonious M. Girgis
Filip Hanzely
Chaoyang He
Samuel Horvath
Martin Jaggi
Tara Javidi
Satyen Chandrakant Kale
Sai Praneeth Karimireddy
Jakub Konečný
Sanmi Koyejo
Tian Li
Peter Richtarik
Karan Singhal
Virginia Smith
Mahdi Soltanolkotabi
Weikang Song
Sebastian Stich
Ameet Talwalkar
Hongyi Wang
Blake Woodworth
Honglin Yuan
Mi Zhang
Tong Zhang
Chunxiang (Jake) Zheng
Chen Zhu
arxiv (2021)
Preview abstract
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.
View details
Disentangling sampling and labeling bias for learning in large-output spaces
Aditya Krishna Menon
Sadeep Jayasumana
International Conference on Machine Learning (ICML) 2021
Preview abstract
Negative sampling is a widely adopted technique to enable efficient training in settings with a large number of classes. Typically, negative sampling approaches aim at approximating the value or gradient of the computationally expensive loss function that takes all the negative labels into account. In this work, we study the connection between negative sampling approaches and loss modification techniques for countering label imbalance. We show that different (bias) correction strategies that accompany negative sampling approaches can have unintended consequences on the model's performance on various data sub-populations. We then propose a unified approach to tackle both sampling bias, arising from working with a subset of all negative classes, and labeling bias, which is inherently present in the data due to label-imbalance. Finally, we verify our analysis and demonstrate the utility of our unified approach through empirical evaluation on standard image classification and retrieval benchmarks.
View details
Why are Adaptive Methods Good for Attention Models?
Jingzhao Zhang
Sai Praneeth Karimireddy
Suvrit Sra
Advances in Neural Information Processing Systems (NeurIPS) (2020)
Preview abstract
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clipping can be applied in practice by developing an adaptive coordinate-wise clipping algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks.
View details