Conditional Sampling with Monotone GANs: Modifying Generative Models to Solve Inverse Problems
Generative models such as GANs, VAEs and Normalizing Flows have been very successful in the unsupervised learning task of generating samples from a high-dimensional probability distribution from its empirical approximation. However, the task of conditioning a high-dimensional distribution from limited empirical samples has attracted less attention in the literature. In this talk we will discuss some ideas in this direction by viewing generative modelling as a measure transport problem. In particular, we present a simple recipe using block-triangular maps and monotonicity constraints that enables standard models such as vanilla GAN to perform conditional sampling. We demonstrate the effectiveness of our method on various examples ranging from synthetic test sets to image in-painting and function space inference in porous medium flow.
The video of the talk is available here.
A Dynamical Central Limit Theorem for Shallow Neural Networks
Recent theoretical works have characterized the dynamics of wide shallow neural networks trained via gradient descent in an asymptotic mean-field limit when the number of parameters tends towards infinity. At initialization, the random sampling of the parameters leads to a fluctuation from the mean-field limit dictated by the classical Central Limit Theorem (CLT). However, as gradient descent induces correlations among the parameters, the question of how the deviation evolves during training remains unclear. Here, we prove that the deviation from the mean-field limit scaled by the width, in the width-asymptotic limit, remains bounded throughout training. The upper bound is given by a Monte-Carlo type resampling error, which does not depend explicitly on the dimension. It then motivates the use of the 2-norm of the underlying measure as a regularization term, which is related to the generalization error as well via the theory of variation-norm function spaces. Moreover, if the mean-field dynamics converges to a measure that interpolates the training data, we prove that the asymptotic deviation eventually vanishes in the CLT scaling. We also complement these results with numerical experiments.
Analysis of Stochastic Gradient Descent in Continuous Time
Stochastic gradient descent is an optimisation method that combines classical gradient descent with random subsampling within the target functional. In this work, we introduce the stochastic gradient process as a continuous-time representation of stochastic gradient descent. The stochastic gradient process is a dynamical system that is coupled with a continuous-time Markov process living on a finite state space. The dynamical system - a gradient flow - represents the gradient descent part, the process on the finite state space represents the random subsampling. Processes of this type are, for instance, used to model clonal populations in fluctuating environments. After introducing it, we study theoretical properties of the stochastic gradient process. We show that it converges weakly to the gradient flow with respect to the full target function, as the learning rate approaches zero. Moreover, we give assumptions under which the stochastic gradient process is exponentially ergodic in the Wasserstein sense. We then additionally assume that the single target functions are strongly convex and the learning rate goes to zero sufficiently slowly. In this case, the process converges with exponential rate to a distribution arbitrarily close to the point mass concentrated in the global minimum of the full target function. We conclude with a discussion of discretisation strategies for the stochastic gradient process and illustrate our concepts in numerical experiments.
How Important is the Train-Validation Split in Meta-Learning?
Meta-learning aims to perform fast adaptation on a new task through learning a “prior” from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct non-splitting method, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful on the linear centroid meta-learning problem, in the asymptotic setting where the number of tasks goes to infinity. We show that the splitting method converges to the optimal prior as expected, whereas the non-splitting method does not in general without structural assumptions on the data. In contrast, if the data are generated from linear models (the realizable regime), we show that both the splitting and non-splitting methods converge to the optimal prior. Further, perhaps surprisingly, our main result shows that the non-splitting method achieves a strictly better asymptotic excess risk under this data distribution, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that data splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the non-splitting method can indeed outperform the splitting method, on both simulations and real meta-learning tasks.
Consistency of Cheeger cuts: Total Variation, Isoperimetry, and Clustering
Clustering unlabeled point clouds is a fundamental problem in machine learning. One classical method for constructing clusters on graph-based data is to solve for Cheeger cuts, which balance between finding clusters that require cutting few graph edges and finding clusters which are similar in size. Although solving for Cheeger cuts on general graphs is very challenging, when the graph is constructed by sampling from a continuum domain one suspects that the problem may be more tractable. I will discuss recent work with Nicolás García Trillos and Matthew Thorpe which establishes quantified convergence rates of discrete Cheeger cuts on graphs over point clouds towards continuum Cheeger sets. This is accomplished by using i) novel estimates of the consistency of total variation energies on such graphs, and ii) recent stability results for classical, continuum isoperimetric problems. This mirrors a guiding principle that the stability of continuum variational problems can often be translated into the point cloud or geometric graph setting.
Predicting What You Already Know Helps: Provable Self-Supervised Learning
Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks), that do not require labeled data, to learn semantic representations. These pretext tasks are created solely using the input features, such as predicting a missing image patch, recovering the color channels of an image from context, or predicting missing words, yet predicting this known information helps in learning representations effective for downstream prediction tasks. In this talk, we posit a mechanism based on approximate conditional independence to formalize how solving certain pretext tasks can learn representations that provably decrease the sample complexity of downstream supervised tasks. Formally, we quantify how the approximate independence between the components of the pretext task (conditional on the label and latent variables) allows us to learn representations that can solve the downstream task with drastically reduced sample complexity by just training a linear layer on top of the learned representation.
The recording of this talk is available on our youtube channel.
Sparse Learning with CART
Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. In this talk, we explore some of the statistical properties of regression trees constructed with CART. We will see that the training error is governed by the Pearson correlation between the optimal decision stump and response data in each node, which can be bounded by constructing a prior distribution on the split points and solving a nonlinear optimization problem. We leverage this connection between the training error and Pearson correlation to show that CART with cost-complexity pruning achieves an optimal complexity/goodness-of-fit tradeoff when the depth scales with the logarithm of the sample size. Data dependent quantities, which adapt to the dimensionality and latent structure of the regression model, are seen to govern the rates of convergence of the prediction error.
The recording is available here.
Geometric Insights into Spectral Clustering by Graph Laplacian Embeddings
We present new theoretical results for procedures identifying coarse structures in a given data set by means of appropriate spectral embeddings. We combine ideas from spectral geometry, metastability, optimal transport, and spectral analysis of weighted graph Laplacians to describe the embedding geometry. Our analysis focuses on 1) studying the embedding step of data clustering and 2) comparing the spectra of graph and continuum Laplacians, linking the original spectral clustering problem with a continuum counterpart. This is joint work with Bamdad Hosseini (Caltech) and Nicolas Garcia Trillos (University of Wisconsin-Madison).
The video is available on our youtube channel.
Stability of Accuracy for Deep Neural Network Classifiers
We examine the stability of accuracy for loss-minimizing training processes that are used for deep neural networks (DNN) and other classifiers. While a classifier is optimized during training by minimizing the loss function, its performance is usually evaluated by the overall accuracy which quantifies the proportion of objects that are well classified. This leads to the question of stability of accuracy: does decreasing loss through training always result in increased accuracy? We formalize the notion of stability and provide examples of instability. We obtain three novel sufficient conditions for stability of training and derive tight bounds on accuracy as loss decreases in the training. The first two conditions apply to the classifier itself by identifying small clusters of misclassified objects as a cause of instability. The third geometric condition identifies flat portions of the training data manifold as sources of instability. The derivation of this condition relies on the propagation of the previous conditions backward through the DNN layers to the data manifold. The multiscale nature of the problem due to several sizes of the small clusters requires that the estimates in the proof have to be compatible with the presence of several scales. Our results do not depend on the algorithm used for training, as long as loss decreases with training.
This is joint work with my advisor L. Berlyand and P.-E. Jabin.
A video of this talk can be found here.
Provable Algorithms for Sampling Non-log-concave Distributions
A fundamental problem in Bayesian machine learning is sampling from a probability distribution given access to its log-pdf. Just as the theory of convex optimization is well-developed, so is the theory of sampling from log-concave distributions. Recent years have seen significant progress in understanding optimization beyond convexity. However, despite the ubiquity of non-log-concave distributions in practice, the theory of sampling from non-log-concave distributions is still in its infancy.
I will survey the challenges and progress in this field. A key problem is that the workhorse algorithm for sampling, Langevin Monte Carlo, can take exponential time to mix for multi-modal distributions. Addressing this problem requires bringing in more algorithmic tools and new methods of analysis.
As a case study, we consider the problem of sampling from a simple mixture of log-concave distributions. By combining Langevin diffusion with simulated tempering, we obtain a Markov process that mixes in polynomial time by transitioning between different temperatures. For the analysis, we introduce novel techniques for proving spectral gaps based on Markov process decomposition.
Covers joint work with Rong Ge and Andrej Risteski.
A recording of this talk can be found here.
Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent
Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion.
Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.
A video of the talk can be found on our youtube channel.
Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Networks
In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that its limit is a global minimizer if initialized properly. We also study the "implicit bias" of this algorithm when the objective is the unregularized logistic loss. We finally discuss what these results tell us about the generalization performance. This is based on joint work with Francis Bach.
A recording of the seminar can be found here.
Dimensionality reduction and matching datasets
Processing large datasets is a pervasive problem occurring across many different knowledge domains. In this talk we focus on two problems motivated from tasks concerning genetic data: dimensionality reduction and matching. First, given labeled points in a high-dimensional vector space, we seek a projection onto a low dimensional subspace that maintains the classification structure of the data. Taking inspiration from large margin nearest neighbor classification, we introduce SqueezeFit, a semidefinite relaxation of this problem. This relaxation is amenable to theoretical analysis, allowing us to provably recover a planted projection operator from the data. We apply a linear programming version of SqueezeFit to the genetic marker selection problem.
Second, we introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.
A recording of the presentation can be found here.
A Few Thoughts on Deep Network Approximation
Deep network approximation is a powerful tool of function approximation via composition. We will present a few new thoughts on deep network approximation from the point of view of scientific computing in practice: given an arbitrary width and depth of neural networks, what is the optimal approximation rate of various function classes? Does the curse of dimensionality exist for generic functions? Can we obtain exponential convergence for generic functions?
Tradeoffs between Robustness and Accuracy
Standard machine learning produces models that are highly accurate on average but that degrade dramatically when the test distribution deviates from the training distribution. While one can train robust models, this often comes at the expense of standard accuracy (on the training distribution). We study this tradeoff in two settings, adversarial examples and minority groups, creating simple examples which highlight generalization issues as a major source of this tradeoff. For adversarial examples, we show that even augmenting with correctly annotated data to promote robustness can produce less accurate models, but we develop a simple method, robust self training, that mitigates this tradeoff using unlabeled data. For minority groups, we show that overparametrization of models can hurt accuracy on the minority groups, though it improves standard accuracy. These results suggest that the "more data" and "bigger models" strategy that works well for the standard setting where train and test distributions are close, need not work on out-of-domain settings.
This is based on joint work with Sang Michael Xie, Shiori Sagawa, Pang Wei Koh, Fanny Yang, John Duchi and Percy Liang.
A recording of the presentation is available here.
Thematic Day on the Mean Field Training of Deep Neural Networks
12pm: Roberto I. Oliveira - A mean-field theory for certain deep neural networks
1pm: Konstantinos Spiliopoulos - Mean field limits of neural networks: typical behavior and fluctuations
2pm: Huy Tuan Pham - A general framework for the mean field limit of multilayer neural networks
3pm: Stephan Wojtowytsch - On the Banach spaces for multi-layer networks and connections to mean field training
See here for abstracts and recordings of the presentations.
On the foundations of computational mathematics, Smale’s 18th problem and the potential limits of AI
There is a profound optimism on the impact of deep learning (DL) and AI in the sciences with Geoffrey Hinton concluding that 'They should stop educating radiologists now'. However, DL has an Achilles heel: it is universaly unstable so that small changes in the initial data can lead to large errors in the final result. This has been documented in a wide variety of applications. Paradoxically, the existence of stable neural networks for these applications is guaranteed by the celebrated Universal Approximation Theorem, however, the stable neural networks are not computed by the current training approaches. We will address this problem and the potential limitations of AI from a foundations point of view. Indeed, the current situation in AI is comparable to the situation in mathematics in the early 20th century, when David Hilbert’s optimism (typically reflected in his 10th problem) suggested no limitations to what mathematics could prove and no restrictions on what computers could compute. Hilbert’s optimism was turned upside down by Goedel and Turing, who established limitations on what mathematics can prove and which problems computers can solve (however, without limiting the impact of mathematics and computer science).
We predict a similar outcome for modern AI and DL, where the limitations of AI (the main topic of Smale’s 18th problem) will be established through the foundations of computational mathematics. We sketch the beginning of such a program by demonstrating how there exist neural networks approximating classical mappings in scientific computing, however, no algorithm (even randomised) can compute such a network to even 1-digit accuracy with probability better than 1/2. We will also show how instability is inherit in the methodology of DL demonstrating that there is no easy remedy, given the current methodology. Finally, we will demonstrate basic examples in inverse problems where there exists (untrained) neural networks that can easily compute a solution to the problem, however, the current DL techniques will need 10^80 data points in the training set to get even 1% success rate.
Trainability and accuracy of artificial neural networks
The methods and models of machine learning (ML) are rapidly becoming de facto tools for the analysis and interpretation of large data sets. Complex classification tasks such as speech and image recognition, automatic translation, decision making, etc. that were out of reach a decade ago are now routinely performed by computers with a high degree of reliability using (deep) neural networks. These performances suggest that DNNs may approximate high-dimensional functions with controllably small errors, potentially outperforming standard interpolation methods based e.g. on Galerkin truncation or finite elements that have been the workhorses of scientific computing. In support of this prospect, in this talk I will present results about the trainability and accuracy of neural networks, obtained by mapping the parameters of the network to a system of interacting particles relaxing on a potential determined by the loss function. This mapping can be used to prove a dynamical variant of the universal approximation theorem showing that the optimal neural network representation can be attained by (stochastic) gradient descent, with a approximation error scaling as the inverse of the network size. I will also show how these findings can be used to accelerate the training of networks and optimize their architecture, using e.g nonlocal transport involving birth/death processes in parameter space.
Towards a mathematical understanding of supervised learning: What we know and what we don't know
Two of the biggest puzzles in machine learning are: Why is it so successful and why is it quite fragile?
This talk will present a framework for unraveling these puzzles from the perspective of approximating functions in high dimensions. We will discuss what's known and what's not known about the approximation generalization properties of neural network type of hypothesis space as well as the dynamics and generalization properties of the training process. We will also discuss the relative merits of shallow vs. deep neural network models and suggest ways to formulate more robust machine learning models.
This is joint work with Chao Ma, Stephan Wojtowytsch and Lei Wu.