Stability of Accuracy for Deep Neural Network Classifiers
We examine the stability of accuracy for loss-minimizing training processes that are used for deep neural networks (DNN) and other classifiers. While a classifier is optimized during training by minimizing the loss function, its performance is usually evaluated by the overall accuracy which quantifies the proportion of objects that are well classified. This leads to the question of stability of accuracy: does decreasing loss through training always result in increased accuracy? We formalize the notion of stability and provide examples of instability. We obtain three novel sufficient conditions for stability of training and derive tight bounds on accuracy as loss decreases in the training. The first two conditions apply to the classifier itself by identifying small clusters of misclassified objects as a cause of instability. The third geometric condition identifies flat portions of the training data manifold as sources of instability. The derivation of this condition relies on the propagation of the previous conditions backward through the DNN layers to the data manifold. The multiscale nature of the problem due to several sizes of the small clusters requires that the estimates in the proof have to be compatible with the presence of several scales. Our results do not depend on the algorithm used for training, as long as loss decreases with training.
This is joint work with my advisor L. Berlyand and P.-E. Jabin.
A video of this talk can be found here.
Provable Algorithms for Sampling Non-log-concave Distributions
A fundamental problem in Bayesian machine learning is sampling from a probability distribution given access to its log-pdf. Just as the theory of convex optimization is well-developed, so is the theory of sampling from log-concave distributions. Recent years have seen significant progress in understanding optimization beyond convexity. However, despite the ubiquity of non-log-concave distributions in practice, the theory of sampling from non-log-concave distributions is still in its infancy.
I will survey the challenges and progress in this field. A key problem is that the workhorse algorithm for sampling, Langevin Monte Carlo, can take exponential time to mix for multi-modal distributions. Addressing this problem requires bringing in more algorithmic tools and new methods of analysis.
As a case study, we consider the problem of sampling from a simple mixture of log-concave distributions. By combining Langevin diffusion with simulated tempering, we obtain a Markov process that mixes in polynomial time by transitioning between different temperatures. For the analysis, we introduce novel techniques for proving spectral gaps based on Markov process decomposition.
Covers joint work with Rong Ge and Andrej Risteski.
A recording of this talk can be found here.
Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent
Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion.
Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.
A video of the talk can be found on our youtube channel.
Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Networks
In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that its limit is a global minimizer if initialized properly. We also study the "implicit bias" of this algorithm when the objective is the unregularized logistic loss. We finally discuss what these results tell us about the generalization performance. This is based on joint work with Francis Bach.
A recording of the seminar can be found here.
Dimensionality reduction and matching datasets
Processing large datasets is a pervasive problem occurring across many different knowledge domains. In this talk we focus on two problems motivated from tasks concerning genetic data: dimensionality reduction and matching. First, given labeled points in a high-dimensional vector space, we seek a projection onto a low dimensional subspace that maintains the classification structure of the data. Taking inspiration from large margin nearest neighbor classification, we introduce SqueezeFit, a semidefinite relaxation of this problem. This relaxation is amenable to theoretical analysis, allowing us to provably recover a planted projection operator from the data. We apply a linear programming version of SqueezeFit to the genetic marker selection problem.
Second, we introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.
A recording of the presentation can be found here.
A Few Thoughts on Deep Network Approximation
Deep network approximation is a powerful tool of function approximation via composition. We will present a few new thoughts on deep network approximation from the point of view of scientific computing in practice: given an arbitrary width and depth of neural networks, what is the optimal approximation rate of various function classes? Does the curse of dimensionality exist for generic functions? Can we obtain exponential convergence for generic functions?
Tradeoffs between Robustness and Accuracy
Standard machine learning produces models that are highly accurate on average but that degrade dramatically when the test distribution deviates from the training distribution. While one can train robust models, this often comes at the expense of standard accuracy (on the training distribution). We study this tradeoff in two settings, adversarial examples and minority groups, creating simple examples which highlight generalization issues as a major source of this tradeoff. For adversarial examples, we show that even augmenting with correctly annotated data to promote robustness can produce less accurate models, but we develop a simple method, robust self training, that mitigates this tradeoff using unlabeled data. For minority groups, we show that overparametrization of models can hurt accuracy on the minority groups, though it improves standard accuracy. These results suggest that the "more data" and "bigger models" strategy that works well for the standard setting where train and test distributions are close, need not work on out-of-domain settings.
This is based on joint work with Sang Michael Xie, Shiori Sagawa, Pang Wei Koh, Fanny Yang, John Duchi and Percy Liang.
A recording of the presentation is available here.
Thematic Day on the Mean Field Training of Deep Neural Networks
12pm: Roberto I. Oliveira - A mean-field theory for certain deep neural networks
1pm: Konstantinos Spiliopoulos - Mean field limits of neural networks: typical behavior and fluctuations
2pm: Huy Tuan Pham - A general framework for the mean field limit of multilayer neural networks
3pm: Stephan Wojtowytsch - On the Banach spaces for multi-layer networks and connections to mean field training
See here for abstracts and recordings of the presentations.
On the foundations of computational mathematics, Smale’s 18th problem and the potential limits of AI
There is a profound optimism on the impact of deep learning (DL) and AI in the sciences with Geoffrey Hinton concluding that 'They should stop educating radiologists now'. However, DL has an Achilles heel: it is universaly unstable so that small changes in the initial data can lead to large errors in the final result. This has been documented in a wide variety of applications. Paradoxically, the existence of stable neural networks for these applications is guaranteed by the celebrated Universal Approximation Theorem, however, the stable neural networks are not computed by the current training approaches. We will address this problem and the potential limitations of AI from a foundations point of view. Indeed, the current situation in AI is comparable to the situation in mathematics in the early 20th century, when David Hilbert’s optimism (typically reflected in his 10th problem) suggested no limitations to what mathematics could prove and no restrictions on what computers could compute. Hilbert’s optimism was turned upside down by Goedel and Turing, who established limitations on what mathematics can prove and which problems computers can solve (however, without limiting the impact of mathematics and computer science).
We predict a similar outcome for modern AI and DL, where the limitations of AI (the main topic of Smale’s 18th problem) will be established through the foundations of computational mathematics. We sketch the beginning of such a program by demonstrating how there exist neural networks approximating classical mappings in scientific computing, however, no algorithm (even randomised) can compute such a network to even 1-digit accuracy with probability better than 1/2. We will also show how instability is inherit in the methodology of DL demonstrating that there is no easy remedy, given the current methodology. Finally, we will demonstrate basic examples in inverse problems where there exists (untrained) neural networks that can easily compute a solution to the problem, however, the current DL techniques will need 10^80 data points in the training set to get even 1% success rate.
Trainability and accuracy of artificial neural networks
The methods and models of machine learning (ML) are rapidly becoming de facto tools for the analysis and interpretation of large data sets. Complex classification tasks such as speech and image recognition, automatic translation, decision making, etc. that were out of reach a decade ago are now routinely performed by computers with a high degree of reliability using (deep) neural networks. These performances suggest that DNNs may approximate high-dimensional functions with controllably small errors, potentially outperforming standard interpolation methods based e.g. on Galerkin truncation or finite elements that have been the workhorses of scientific computing. In support of this prospect, in this talk I will present results about the trainability and accuracy of neural networks, obtained by mapping the parameters of the network to a system of interacting particles relaxing on a potential determined by the loss function. This mapping can be used to prove a dynamical variant of the universal approximation theorem showing that the optimal neural network representation can be attained by (stochastic) gradient descent, with a approximation error scaling as the inverse of the network size. I will also show how these findings can be used to accelerate the training of networks and optimize their architecture, using e.g nonlocal transport involving birth/death processes in parameter space.
Towards a mathematical understanding of supervised learning: What we know and what we don't know
Two of the biggest puzzles in machine learning are: Why is it so successful and why is it quite fragile?
This talk will present a framework for unraveling these puzzles from the perspective of approximating functions in high dimensions. We will discuss what's known and what's not known about the approximation generalization properties of neural network type of hypothesis space as well as the dynamics and generalization properties of the training process. We will also discuss the relative merits of shallow vs. deep neural network models and suggest ways to formulate more robust machine learning models.
This is joint work with Chao Ma, Stephan Wojtowytsch and Lei Wu.