# Past Events

# 2021

**Wed May 5**

12 noon ET

**Adversarial Classification, Optimal Transport, and Geometric Flows**

The purpose of this talk is to provide an explicit link between the three topics that form the talk's title, and to introduce a new perspective (more dynamic and geometric) to understand robust classification problems. For concreteness, we will discuss a version of adversarial classification where an adversary is empowered to corrupt data inputs up to some distance \epsilon. We will first describe necessary conditions associated with the optimal classifier subject to such an adversary. Then, using the necessary conditions we derive a geometric evolution equation which can be used to track the change in classification boundaries as \veps varies. This evolution equation may be described as an uncoupled system of differential equations in one dimension, or as a mean curvature type equation in higher dimension. In one dimension we rigorously prove that one can use the initial value problem starting from \veps=0, which is simply the Bayes classifier, to solve for the global minimizer of the adversarial problem. Global optimality is certified using a duality principle between the original adversarial problem and an optimal transport problem. Several open questions and directions for further research will be discussed.

**Wed Apr 28**

12 noon ET

**Barriers to Deploying Deep Learning Models During the COVID-19 Pandemic**

A promising application for deep learning models is in assisting clinicians with interpreting X-ray and CT scans, especially when treating respiratory diseases. At the onset of the COVID-19 pandemic, radiologists had to quickly learn how to identify a new disease on chest X-rays and CT scans, and use this information to decide how to allocate scarce resources like ventilators. Researchers around the world developed deep learning models to help clinicians with these decisions, and some models were deployed after only three weeks of testing.

Our group reviewed over 1,000 studies that introduce deep learning models for interpreting chest X-rays or CT scans of COVID-19 patients to determine which models, if any, have the potential to help clinicians during the pandemic. In this talk, I will present our findings and discuss how this pandemic could inform researchers creating deployable deep learning models in healthcare.

This talk is based on the paper [1] (arXiv version). The recording can be found here.

[1] Roberts, M., Driggs, D., Thorpe, M., and the AIX-COVNET Collaboration. "Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans”. *Nat. Mach. Intel.* 3, 199–217 (2021).

**Wed Apr 21**

12 noon ET

**Machine Learning and Dynamical Systems meet in Reproducing Kernel Hilbert Spaces **** **

Since its inception in the 19th century through the efforts of Poincaré and Lyapunov, the theory of dynamical systems addresses the qualitative behaviour of dynamical systems as understood from models. From this perspective, the modeling of dynamical processes in applications requires a detailed understanding of the processes to be analyzed. This deep understanding leads to a model, which is an approximation of the observed reality and is often expressed by a system of Ordinary/Partial, Underdetermined (Control), Deterministic/Stochastic differential or difference equations. While models are very precise for many processes, for some of the most challenging applications of dynamical systems (such as climate dynamics, brain dynamics, biological systems or the financial markets), the development of such models is notably difficult. On the other hand, the field of machine learning is concerned with algorithms designed to accomplish a certain task, whose performance improves with the input of more data. Applications for machine learning methods include computer vision, stock market analysis, speech recognition, recommender systems and sentiment analysis in social media. The machine learning approach is invaluable in settings where no explicit model is formulated, but measurement data is available. This is frequently the case in many systems of interest, and the development of data-driven technologies is becoming increasingly important in many applications. The intersection of the fields of dynamical systems and machine learning is largely unexplored and the objective of this talk is to show that working in reproducing kernel Hilbert spaces offers tools for a data-based theory of nonlinear dynamical systems. In this talk, we introduce a data-based approach to estimating key quantities which arise in the study of nonlinear autonomous, control and random dynamical systems. Our approach hinges on the observation that much of the existing linear theory may be readily extended to nonlinear systems - with a reasonable expectation of success- once the nonlinear system has been mapped into a high or infinite dimensional Reproducing Kernel Hilbert Space. In particular, we develop computable, non-parametric estimators approximating controllability and observability energies for nonlinear systems. We apply this approach to the problem of model reduction of nonlinear control systems. It is also shown that the controllability energy estimator provides a key means for approximating the invariant measure of an ergodic, stochastically forced nonlinear system. We also show how kernel methods can be used to detect critical transitions for some multi scale dynamical systems. We also use the method of kernel flows to predict some chaotic dynamical systems. Finally, we show how kernel methods can be used to approximate center manifolds, propose a data-based version of the centre manifold theorem and construct Lyapunov functions for nonlinear ODEs. This is joint work with Jake Bouvrie (MIT, USA), Peter Giesl (University of Sussex, UK), Christian Kuehn (TUM, Munich/Germany), Romit Malik (ANNL), Sameh Mohamed (SUTD, Singapore), Houman Owhadi (Caltech), Martin Rasmussen (Imperial College London), Kevin Webster (Imperial College London), Bernard Hasasdonk and Dominik Wittwar (University of Stuttgart), Gabriele Santin (Fondazione Bruno Kessler).

**Wed Apr 7**

12 noon ET

**Discrete Optimization Methods for Group Model Selection in Compressed Sensing**

In this talk we study the problem of signal recovery for group models. More precisely for a given set of groups, each containing a small subset of indices, and for given linear sketches of the true signal vector which is known to be group-sparse in the sense that its support is contained in the union of a small number of these groups, we study algorithms which successfully recover the true signal just by the knowledge of its linear sketches. We derive model projection complexity results and algorithms for more general group models than the state-of-the-art. We consider two versions of the classical Iterative Hard Thresholding algorithm (IHT). The classical version iteratively calculates the exact projection of a vector onto the group model, while the approximate version (AM-IHT) uses a head- and a tail-approximation iteratively. We apply both variants to group models and analyze the two cases where the sensing matrix is a Gaussian matrix and a model expander matrix.

slides (updated concerning Head and Tail approximation)

**Wed Mar 24**

12 noon ET

**Random walks and PDEs in graph-based learning**

I will discuss some applications of random walks and PDEs in graph-based learning, both for theoretical analysis and algorithm development. Graph-based learning is a field within machine learning that uses similarities between datapoints to create efficient representations of high-dimensional data for tasks like semi-supervised classification, clustering and dimension reduction. There has been considerable interest recently in semi-supervised learning problems with very few labeled examples (e.g., 1 label per class). The widely used Laplacian regularization is ill-posed at low label rates and gives very poor classification results. In the first part of the talk, we will use the random walk interpretation of the graph Laplacian to precisely characterize the lowest label rate at which Laplacian regularized semi-supervised learning is well-posed. At lower label rates, where Laplace learning performs poorly, we will show how our random walk analysis leads to a new algorithm, called Poisson learning, that is probably more stable and informative than Laplace learning. We will conclude with some applications of Poisson learning to image classification and mesh segmentation of broken bone fragments of interest in anthropology.

**Wed Mar 17**

12 noon ET

**Function Approximation via Sparse Random Fourier Features**

Random feature methods have been successful in various machine learning tasks, are easy to compute, and come with theoretical accuracy bounds. They serve as an alternative approach to standard neural networks since they can represent similar function spaces without a costly training phase. However, for accuracy, random feature methods require more measurements than trainable parameters, limiting their use for data-scarce applications or problems in scientific machine learning. This paper introduces the sparse random feature method that learns parsimonious random feature models utilizing techniques from compressive sensing. We provide uniform bounds on the approximation error for functions in a reproducing kernel Hilbert space depending on the number of samples and the distribution of features. The error bounds improve with additional structural conditions, such as coordinate sparsity, compact clusters of the spectrum, or rapid spectral decay. We show that the sparse random feature method outperforms shallow networks for well-structured functions and applications to scientific machine learning tasks.

*Link available upon request until March 24th.*

**Wed Mar 10**

12 noon ET

**Finite Width, Large Depth Neural Networks as Perturbatively Solvable Models**

Deep neural networks are often considered to be complicated "black boxes," for which a systematic analysis is not only out of reach but potentially impossible. In this talk, which is based on ongoing joint work with Dan Roberts and Sho Yaida, I will make the opposite claim. Namely, that deep neural networks at initialization are perturbatively solvable models. The perturbative parameter is the width n of the network and we can obtain corrections to all orders in n. Our approach applies to networks at finite width n and large depth L. A key point is an emergent tension between depth and width. Large values of n make neural networks more like Gaussian processes, which are well behaved but incapable of feature learning due to a frozen NTK (at least with standard initialization schemes). Large values of L, in contrast, amplify higher cumulants and change in the NTK, both of which scale with the network aspect ratio L/n.

**Wed Mar 3**

12 noon ET

**Structure preservation and convergence in scientific machine learning **

Physics-informed techniques have emerged as a means of incorporating prior knowledge into machine learning. These techniques generally function by minimizing a hybrid loss, regularizing a traditional $\ell_2$ error with a PDE residual. While remarkably effective, these approaches suffer two major shortcomings. Firstly, such neural network (NN) solutions of PDEs generally fail to converge with increasing architecture size. Despite recent work establishing NNs may approximate at least as well as hp-finite element spaces, in practice when training with gradient methods O(1) optimization errors prevent realizing consistency. Secondly, the regularized losses introduce physics via a penalized residual, and it is well known from classical numerical analysis that the approximation space must be designed in tandem with the residual to ensure converge to a given PDE.

We conjecture that the same tools used to design convergent and structure-preserving properties in forward simulation may be used to design scientific ML architectures with similar guarantees. In this talk, we present two current works which address each of these issues. First, we introduce partition of unity networks (POUnets) to develop convergent approximation with deep networks. It has been shown that traditional feed forward networks may approximate by emulating partitions of unity (POU), and then emulating monomials on each partition, ultimately yielding a localized polynomial approximation and associated hp-convergence. Rather than emulating these components, POUnets function by directly incorporating both the POU and polynomials into the architecture. The resulting approximation breaks the curse of dimensionality and admits a fast least-squares optimization strategy. Predictions are competitive with high-order finite element spaces, and provide superior approximation for problems with reduced regularity.

Secondly, we introduce a data-driven exterior calculus (DDEC) which may be used to endow scientific ML architectures with the structure-preserving properties of mimetic PDE discretization. Traditional mimetic methods function by exploiting the exterior calculus structures offered by a mesh to construct discrete operators that exactly mimic the topological properties of continuum operators. We show how graphs may be used as a surrogate for the topology offered by graphs, and present new network architectures which allows "physics-informed" machine learning which exactly preserves conservation, guarantees extraction of well-posed problems, and allows handling of the non-trivial null-spaces occurring in fields such as electromagnetics.

If time permits, we will additionally share some current results applying these tools in challenging data-driven modeling effort at Sandia, related to data-driven shock hydrodynamics in metals and discovery of surrogates for semiconductors in radiation environments.

**Wed Feb 17**

12 noon ET

**Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability**

In this paper we revisit some classic problems in classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate η. In a recent work, Diakonikolas, Gouleakis, and Tzamos resolved a long-standing problem by giving the first efficient learner which achieves error η+ϵ for any ϵ>0. However, their algorithm outputs a complicated hypothesis which partitions space into poly(d,1/ϵ) regions. Here, we give a simpler algorithm and in the process resolve a number of outstanding open questions: (1) We give the first proper learner for Massart halfspaces that achieves error η+ϵ. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure which converts an arbitrarily complex classifier into an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show that any SQ algorithm requires super-polynomially many queries to achieve error 𝖮𝖯𝖳+ϵ. Moreover, we study classification under generalized linear models where E[Y|X] = \sigma(w \cdot X) for any odd, monotone, and Lipschitz function \sigma This family includes the previously mentioned halfspace models as a special case, but also includes models like logistic regression where the noise level increases near the decision boundary. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Finally, we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties.

**Wed Feb 10**

12 noon ET

**Convergence and optimality of single-layer neural networks for reinforcement learning**

Recent groundbreaking results have established a convergence theory for wide neural networks in the supervised learning setting. Depending on the scaling of parameters at initialization, the (stochastic) gradient descent dynamics of these models converge towards different deterministic limits known as the mean-field and the lazy training regimes.

In this talk, we extend some of these recent results to examples of prototypical algorithms in reinforcement learning: Temporal-Difference (TD) learning and Policy Gradients. In the first case, we prove convergence and optimality of wide neural network training dynamics in the lazy and mean-field regime, respectively. To establish these results, we bypass the lack of gradient structure of the TD learning dynamics by leveraging Lyapunov function techniques in the lazy training regime and sufficient expressivity of the activation function in the mean-field framework. We further show that similar optimality results hold for wide, single layer neural networks trained by entropy-regularized softmax Policy Gradients despite the nonlinear and nonconvex nature of the risk function in this setting.

This is work in collaboration with Jianfeng Lu.

**Wed Feb 3**

12 noon ET

**Total ****V****ariation ****M****inimization on ****G****raphs for ****S****emisupervised and ****U****nsupervised ****M****achine ****L****earning**** **

I will speak about a general class of machine learning problems in which data lives on similarity graphs and the goal is to solve a penalized graph min-cut problem. Applications include semi-supervised learning, unsupervised learning, and modularity optimization – originally developed for community detection on networks – but recast as an unsupervised machine learning problem. These problems have a mathematical connection to total variation minimization in Euclidean space and this analogy leads to a natural class of machine learning algorithms that mimic pseudo-spectral methods in nonlinear partial differential equations. The methods are graph analogues of geometric motion – e.g. Motion by mean curvature and the MBO scheme to approximate that dynamics.

**Wed Jan 20**

12 noon ET

**Geometric Methods for Machine Learning and Optimization**

Many machine learning applications involve non-Euclidean data, such as graphs, strings or matrices. In such cases, exploiting Riemannian geometry can deliver algorithms that are computationally superior to standard (Euclidean) nonlinear programming approaches. This observation has resulted in an increasing interest in Riemannian methods in the optimization and machine learning community.

In the first part of the talk, we consider the task of learning a robust classifier in hyperbolic space. Such spaces have received a surge of interest for representing large-scale, hierarchical data, due to the fact that they achieve better representation accuracy with lower dimensions. We present the first theoretical guarantees for the (robust) large-margin learning problem in hyperbolic space and discuss conditions under which hyperbolic methods are guaranteed to surpass the performance of their Euclidean counterparts. In the second part, we introduce Riemannian Frank-Wolfe (RFW) methods for constraint optimization on manifolds. Here, the goal of the theoretical analysis is two-fold: We first show that RFW converges at a nonasymptotic sublinear rate, recovering the best-known guarantees for its Euclidean counterpart. Secondly, we discuss how to implement the method efficiently on matrix manifolds. Finally, we consider applications of RFW to the computation of Riemannian centroids and Wasserstein barycenters, which are crucial subroutines in many machine learning methods.

Based on joint work with Suvrit Sra (MIT) and Manzil Zaheer, Ankit Singh Rawat, Aditya Menon and Sanjiv Kumar (all Google Research).

**Wed Jan 13**

12 noon ET

**Machine Learned Regularization for Solving Inverse Problems**

Inverse problems are about the reconstruction of an unknown physical quantity from indirect measurements. Most inverse problems of interest are ill-posed and require appropriate mathematical treatment for recovering meaningful solutions. Regularization is one of the main mechanisms to turn inverse problems into well-posed ones by adding prior information about the unknown quantity to the problem, often in the form of assumed regularity of solutions. Classically, such regularization approaches are handcrafted. Examples include Tikhonov regularization, the total variation and several sparsity-promoting regularizers such as the L1 norm of Wavelet coefficients of the solution. While such handcrafted approaches deliver mathematically and computationally robust solutions to inverse problems, providing a universal approach to their solution, they are also limited by our ability to model solution properties and to realise these regularization approaches computationally.

Recently, a new paradigm has been introduced to the regularization of inverse problems, which derives regularization approaches for inverse problems in a data driven way. Here, regularization is not mathematically modelled in the classical sense, but modelled by highly over-parametrised models, typically deep neural networks, that are adapted to the inverse problems at hand by appropriately selected (and usually plenty of) training data.

In this talk, I will review some machine learning based regularization techniques, present some work on unsupervised and deeply learned convex regularisers and their application to image reconstruction from tomographic and blurred measurements, and finish by discussing some open mathematical problems.

# 2020

**Wed Dec 16**

12 noon ET

**The dual of the margin: improved analyses and rates for gradient descent’s implicit bias**

The implicit bias of gradient descent, and specifically its margin maximization properties, have arisen as a promising explanation for the good generalization of deep networks. The purpose of this talk is to demonstrate the effectiveness of a dual problem to smoothed margin maximization. Concretely, this talk will develop this dual, as well as a variety of consequences in linear and nonlinear settings.

In the linear case, this dual perspective firstly will yield fast 1/t rates for margin maximization and implicit bias. This is faster than any prior first-order hard-margin SVM solver, which achieves 1/sqrt{t} at best.

Secondly, the dual analysis also allows a characterization of the implicit bias, even outside the standard setting of exponentially-tailed losses; in this sense, it is gradient descent, and not a particular loss structure which leads to implicit bias.

In the nonlinear case, duality will enable the proof of a gradient alignment property: asymptotically, the parameters and their gradients become colinear. Although abstract, this property in turn implies various existing and new margin maximization results.

Joint work with Matus Telgarsky.

**Wed Dec 09**

12 noon ET

**A PDE Interpretation of Prediction with Expert Advice**

We study the problem of prediction of binary sequences with expert advice in the online setting, which is a classic example of online machine learning. We interpret the binary sequence as the price history of a stock, and view the predictor as an investor, which converts the problem into a stock prediction problem. In this framework, an investor, who predicts the daily movements of a stock, and an adversarial market, who controls the stock, play against each other over N turns. The investor combines the predictions of n ≥ 2 experts in order to make a decision about how much to invest at each turn, and aims to minimize their regret with respect to the best-performing expert at the end of the game. We consider the problem with history-dependent experts, in which each expert uses the previous d days of history of the market in making their predictions. The prediction problem is played (in part) over a discrete graph called the d dimensional de Bruijn graph.

We focus on an appropriate continuum limit and using methods from optimal control, graph theory, and partial differential equations, we discuss strategies for the investor and the adversarial market. We prove that the value function for this game, rescaled appropriately, converges as N → ∞ at a rate of O(N-1/2) (for C4 payoff functions) to the viscosity solution of a nonlinear degenerate parabolic PDE. It can be understood as the Hamilton-Jacobi-Issacs equation for the two-person game. As a result, we are able to deduce asymptotically optimal strategies for the investor.

This is joint work with Robert Kohn and Jeff Calder.

**Wed Nov 25**

12 noon ET

**Neural network performance for classification problems with boundaries of Barron class**

We study classification problems in which the distances between the different classes are not necessarily positive, but for which the boundaries between the classes are well-behaved. More precisely, we assume these boundaries to be locally described by graphs of functions of Barron-class. ReLU neural networks can approximate and estimate classification functions of this type with rates independent of the ambient dimension. More formally, three-layer networks with $N$ neurons can approximate such functions with $L^1$-error bounded by $O(N^{-1/2})$. Furthermore, given $m$ training samples from such a function, and using ReLU networks of a suitable architecture as the hypothesis space, any empirical risk minimizer has generalization error bounded by $O(m^{-1/4})$. All implied constants depend only polynomially on the input dimension. We also discuss the optimality of these rates. Our results mostly rely on the "Fourier-analytic" Barron spaces that consist of functions with finite first Fourier moment. But since several different function spaces have been dubbed "Barron spaces'' in the recent literature, we discuss how these spaces relate to each other. We will see that they differ more than the existing literature suggests.

**Wed Nov 18**

12 noon ET

**Conditional Sampling with Monotone GANs: Modifying Generative Models to Solve Inverse Problems**

Generative models such as GANs, VAEs and Normalizing Flows have been very successful in the unsupervised learning task of generating samples from a high-dimensional probability distribution from its empirical approximation. However, the task of conditioning a high-dimensional distribution from limited empirical samples has attracted less attention in the literature. In this talk we will discuss some ideas in this direction by viewing generative modelling as a measure transport problem. In particular, we present a simple recipe using block-triangular maps and monotonicity constraints that enables standard models such as vanilla GAN to perform conditional sampling. We demonstrate the effectiveness of our method on various examples ranging from synthetic test sets to image in-painting and function space inference in porous medium flow.

The video of the talk is available here.

**Wed Nov 11**

12 noon ET

**A Dynamical Central Limit Theorem for Shallow Neural Networks**

Recent theoretical works have characterized the dynamics of wide shallow neural networks trained via gradient descent in an asymptotic mean-field limit when the number of parameters tends towards infinity. At initialization, the random sampling of the parameters leads to a fluctuation from the mean-field limit dictated by the classical Central Limit Theorem (CLT). However, as gradient descent induces correlations among the parameters, the question of how the deviation evolves during training remains unclear. Here, we prove that the deviation from the mean-field limit scaled by the width, in the width-asymptotic limit, remains bounded throughout training. The upper bound is given by a Monte-Carlo type resampling error, which does not depend explicitly on the dimension. It then motivates the use of the 2-norm of the underlying measure as a regularization term, which is related to the generalization error as well via the theory of variation-norm function spaces. Moreover, if the mean-field dynamics converges to a measure that interpolates the training data, we prove that the asymptotic deviation eventually vanishes in the CLT scaling. We also complement these results with numerical experiments.

**Wed Nov 04**

12 noon ET

**Analysis of Stochastic Gradient Descent in Continuous Time**

Stochastic gradient descent is an optimisation method that combines classical gradient descent with random subsampling within the target functional. In this work, we introduce the stochastic gradient process as a continuous-time representation of stochastic gradient descent. The stochastic gradient process is a dynamical system that is coupled with a continuous-time Markov process living on a finite state space. The dynamical system - a gradient flow - represents the gradient descent part, the process on the finite state space represents the random subsampling. Processes of this type are, for instance, used to model clonal populations in fluctuating environments. After introducing it, we study theoretical properties of the stochastic gradient process. We show that it converges weakly to the gradient flow with respect to the full target function, as the learning rate approaches zero. Moreover, we give assumptions under which the stochastic gradient process is exponentially ergodic in the Wasserstein sense. We then additionally assume that the single target functions are strongly convex and the learning rate goes to zero sufficiently slowly. In this case, the process converges with exponential rate to a distribution arbitrarily close to the point mass concentrated in the global minimum of the full target function. We conclude with a discussion of discretisation strategies for the stochastic gradient process and illustrate our concepts in numerical experiments.

**Wed Oct 28**

12 noon ET

**How Important is the Train-Validation Split in Meta-Learning?**

Meta-learning aims to perform fast adaptation on a new task through learning a “prior” from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct non-splitting method, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful on the linear centroid meta-learning problem, in the asymptotic setting where the number of tasks goes to infinity. We show that the splitting method converges to the optimal prior as expected, whereas the non-splitting method does not in general without structural assumptions on the data. In contrast, if the data are generated from linear models (the realizable regime), we show that both the splitting and non-splitting methods converge to the optimal prior. Further, perhaps surprisingly, our main result shows that the non-splitting method achieves a strictly better asymptotic excess risk under this data distribution, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that data splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the non-splitting method can indeed outperform the splitting method, on both simulations and real meta-learning tasks.

**Wed Oct 21**

12 noon ET

### Consistency of Cheeger cuts: Total Variation, Isoperimetry, and Clustering

Clustering unlabeled point clouds is a fundamental problem in machine learning. One classical method for constructing clusters on graph-based data is to solve for Cheeger cuts, which balance between finding clusters that require cutting few graph edges and finding clusters which are similar in size. Although solving for Cheeger cuts on general graphs is very challenging, when the graph is constructed by sampling from a continuum domain one suspects that the problem may be more tractable. I will discuss recent work with Nicolás García Trillos and Matthew Thorpe which establishes quantified convergence rates of discrete Cheeger cuts on graphs over point clouds towards continuum Cheeger sets. This is accomplished by using i) novel estimates of the consistency of total variation energies on such graphs, and ii) recent stability results for classical, continuum isoperimetric problems. This mirrors a guiding principle that the stability of continuum variational problems can often be translated into the point cloud or geometric graph setting.

**Wed Oct 7**

12 noon ET

### Predicting What You Already Know Helps: Provable Self-Supervised Learning

Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks), that do not require labeled data, to learn semantic representations. These pretext tasks are created solely using the input features, such as predicting a missing image patch, recovering the color channels of an image from context, or predicting missing words, yet predicting this known information helps in learning representations effective for downstream prediction tasks. In this talk, we posit a mechanism based on approximate conditional independence to formalize how solving certain pretext tasks can learn representations that provably decrease the sample complexity of downstream supervised tasks. Formally, we quantify how the approximate independence between the components of the pretext task (conditional on the label and latent variables) allows us to learn representations that can solve the downstream task with drastically reduced sample complexity by just training a linear layer on top of the learned representation.

The recording of this talk is available on our youtube channel.

**Wed Sept 30**

12 noon ET

**Sparse Learning with CART**

**Sparse Learning with CART**

Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. In this talk, we explore some of the statistical properties of regression trees constructed with CART. We will see that the training error is governed by the Pearson correlation between the optimal decision stump and response data in each node, which can be bounded by constructing a prior distribution on the split points and solving a nonlinear optimization problem. We leverage this connection between the training error and Pearson correlation to show that CART with cost-complexity pruning achieves an optimal complexity/goodness-of-fit tradeoff when the depth scales with the logarithm of the sample size. Data dependent quantities, which adapt to the dimensionality and latent structure of the regression model, are seen to govern the rates of convergence of the prediction error.

The recording is available here.

**Wed Sept 23**

12 noon ET

**Geometric **I**nsights into **S**pectral **C**lustering by **G**raph Laplacian **E**mbeddings **

**Geometric**I

**nsights into**S

**pectral**C

**lustering by**G

**raph Laplacian**E

**mbeddings**

We present new theoretical results for procedures identifying coarse structures in a given data set by means of appropriate spectral embeddings. We combine ideas from spectral geometry, metastability, optimal transport, and spectral analysis of weighted graph Laplacians to describe the embedding geometry. Our analysis focuses on 1) studying the embedding step of data clustering and 2) comparing the spectra of graph and continuum Laplacians, linking the original spectral clustering problem with a continuum counterpart. This is joint work with Bamdad Hosseini (Caltech) and Nicolas Garcia Trillos (University of Wisconsin-Madison).

The video is available on our youtube channel.

**Wed Sept 16**

12 noon ET

### Stability of Accuracy for Deep Neural Network Classifiers

We examine the stability of accuracy for loss-minimizing training processes that are used for deep neural networks (DNN) and other classifiers. While a classifier is optimized during training by minimizing the loss function, its performance is usually evaluated by the overall accuracy which quantifies the proportion of objects that are well classified. This leads to the question of stability of accuracy: does decreasing loss through training always result in increased accuracy? We formalize the notion of stability and provide examples of instability. We obtain three novel sufficient conditions for stability of training and derive tight bounds on accuracy as loss decreases in the training. The first two conditions apply to the classifier itself by identifying small clusters of misclassified objects as a cause of instability. The third geometric condition identifies flat portions of the training data manifold as sources of instability. The derivation of this condition relies on the propagation of the previous conditions backward through the DNN layers to the data manifold. The multiscale nature of the problem due to several sizes of the small clusters requires that the estimates in the proof have to be compatible with the presence of several scales. Our results do not depend on the algorithm used for training, as long as loss decreases with training.

This is joint work with my advisor L. Berlyand and P.-E. Jabin.

A video of this talk can be found here.

**Wed Sept 9**

12 noon ET

**Provable Algorithms for Sampling Non-log-concave Distributions**

**Provable Algorithms for Sampling Non-log-concave Distributions**

A fundamental problem in Bayesian machine learning is sampling from a probability distribution given access to its log-pdf. Just as the theory of convex optimization is well-developed, so is the theory of sampling from log-concave distributions. Recent years have seen significant progress in understanding optimization beyond convexity. However, despite the ubiquity of non-log-concave distributions in practice, the theory of sampling from non-log-concave distributions is still in its infancy.

I will survey the challenges and progress in this field. A key problem is that the workhorse algorithm for sampling, Langevin Monte Carlo, can take exponential time to mix for multi-modal distributions. Addressing this problem requires bringing in more algorithmic tools and new methods of analysis.

As a case study, we consider the problem of sampling from a simple mixture of log-concave distributions. By combining Langevin diffusion with simulated tempering, we obtain a Markov process that mixes in polynomial time by transitioning between different temperatures. For the analysis, we introduce novel techniques for proving spectral gaps based on Markov process decomposition.

Covers joint work with Rong Ge and Andrej Risteski.

A recording of this talk can be found here.

**Wed Sept 2**

12 noon ET

### Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent

Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion.

Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.

A video of the talk can be found on our youtube channel.

**Wed Aug 26**

12 noon ET

**Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Network****s**

**Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Network**

**s**

In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that its limit is a global minimizer if initialized properly. We also study the "implicit bias" of this algorithm when the objective is the unregularized logistic loss. We finally discuss what these results tell us about the generalization performance. This is based on joint work with Francis Bach.

A recording of the seminar can be found here.

**Wed Aug 19**

12 noon ET

**Dimensionality reduction and matching datasets**

**Dimensionality reduction and matching datasets**

Processing large datasets is a pervasive problem occurring across many different knowledge domains. In this talk we focus on two problems motivated from tasks concerning genetic data: dimensionality reduction and matching. First, given labeled points in a high-dimensional vector space, we seek a projection onto a low dimensional subspace that maintains the classification structure of the data. Taking inspiration from large margin nearest neighbor classification, we introduce SqueezeFit, a semidefinite relaxation of this problem. This relaxation is amenable to theoretical analysis, allowing us to provably recover a planted projection operator from the data. We apply a linear programming version of SqueezeFit to the genetic marker selection problem.

Second, we introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.

A recording of the presentation can be found here.

**Wed Aug 12**

12 noon ET

**A Few Thoughts on Deep Network Approximation**

**A Few Thoughts on Deep Network Approximation**

Deep network approximation is a powerful tool of function approximation via composition. We will present a few new thoughts on deep network approximation from the point of view of scientific computing in practice: given an arbitrary width and depth of neural networks, what is the optimal approximation rate of various function classes? Does the curse of dimensionality exist for generic functions? Can we obtain exponential convergence for generic functions?

**Wed July 29**

12 noon ET

**Tradeoffs between Robustness and Accuracy**

**Tradeoffs between Robustness and Accuracy**

Standard machine learning produces models that are highly accurate on average but that degrade dramatically when the test distribution deviates from the training distribution. While one can train robust models, this often comes at the expense of standard accuracy (on the training distribution). We study this tradeoff in two settings, adversarial examples and minority groups, creating simple examples which highlight generalization issues as a major source of this tradeoff. For adversarial examples, we show that even augmenting with correctly annotated data to promote robustness can produce less accurate models, but we develop a simple method, robust self training, that mitigates this tradeoff using unlabeled data. For minority groups, we show that overparametrization of models can hurt accuracy on the minority groups, though it improves standard accuracy. These results suggest that the "more data" and "bigger models" strategy that works well for the standard setting where train and test distributions are close, need not work on out-of-domain settings.

This is based on joint work with Sang Michael Xie, Shiori Sagawa, Pang Wei Koh, Fanny Yang, John Duchi and Percy Liang.

A recording of the presentation is available here.

**Sat July 25** 12 noon ET

### Thematic Day on the Mean Field Training of Deep Neural Networks

12pm: Roberto I. Oliveira - **A mean-field theory for certain deep neural networks**

1pm: Konstantinos Spiliopoulos - **Mean field limits of neural networks: typical behavior and fluctuations**

2pm: Huy Tuan Pham - **A general framework for the mean field limit of multilayer neural networks **

3pm: Stephan Wojtowytsch - **On the Banach spaces for multi-layer networks and connections to mean field training**

See here for abstracts and recordings of the presentations.

**Wed July 15**

12 noon ET

**On the foundations of computational mathematics, Smale’s 18th problem and the potential limits of AI**

**On the foundations of computational mathematics, Smale’s 18th problem and the potential limits of AI**

There is a profound optimism on the impact of deep learning (DL) and AI in the sciences with Geoffrey Hinton concluding that 'They should stop educating radiologists now'. However, DL has an Achilles heel: it is universaly unstable so that small changes in the initial data can lead to large errors in the final result. This has been documented in a wide variety of applications. Paradoxically, the existence of stable neural networks for these applications is guaranteed by the celebrated Universal Approximation Theorem, however, the stable neural networks are not computed by the current training approaches. We will address this problem and the potential limitations of AI from a foundations point of view. Indeed, the current situation in AI is comparable to the situation in mathematics in the early 20th century, when David Hilbert’s optimism (typically reflected in his 10th problem) suggested no limitations to what mathematics could prove and no restrictions on what computers could compute. Hilbert’s optimism was turned upside down by Goedel and Turing, who established limitations on what mathematics can prove and which problems computers can solve (however, without limiting the impact of mathematics and computer science).

We predict a similar outcome for modern AI and DL, where the limitations of AI (the main topic of Smale’s 18th problem) will be established through the foundations of computational mathematics. We sketch the beginning of such a program by demonstrating how there exist neural networks approximating classical mappings in scientific computing, however, no algorithm (even randomised) can compute such a network to even 1-digit accuracy with probability better than 1/2. We will also show how instability is inherit in the methodology of DL demonstrating that there is no easy remedy, given the current methodology. Finally, we will demonstrate basic examples in inverse problems where there exists (untrained) neural networks that can easily compute a solution to the problem, however, the current DL techniques will need 10^80 data points in the training set to get even 1% success rate.

*A recording of this talk is available **here**. The slides are available **here**. A summary of the zoom chat Q&A during the seminar by **Matthew Colbrook** is available **here**.*

**Wed July 08**

12 noon ET

### Trainability and accuracy of artificial neural networks

The methods and models of machine learning (ML) are rapidly becoming de facto tools for the analysis and interpretation of large data sets. Complex classification tasks such as speech and image recognition, automatic translation, decision making, etc. that were out of reach a decade ago are now routinely performed by computers with a high degree of reliability using (deep) neural networks. These performances suggest that DNNs may approximate high-dimensional functions with controllably small errors, potentially outperforming standard interpolation methods based e.g. on Galerkin truncation or finite elements that have been the workhorses of scientific computing. In support of this prospect, in this talk I will present results about the trainability and accuracy of neural networks, obtained by mapping the parameters of the network to a system of interacting particles relaxing on a potential determined by the loss function. This mapping can be used to prove a dynamical variant of the universal approximation theorem showing that the optimal neural network representation can be attained by (stochastic) gradient descent, with a approximation error scaling as the inverse of the network size. I will also show how these findings can be used to accelerate the training of networks and optimize their architecture, using e.g nonlocal transport involving birth/death processes in parameter space.

*A recording of this talk is available **here**. The sildes are available **here**.*

**Wed July 01**

12 noon ET

### Towards a mathematical understanding of supervised learning: What we know and what we don't know

Two of the biggest puzzles in machine learning are: Why is it so successful and why is it quite fragile?

This talk will present a framework for unraveling these puzzles from the perspective of approximating functions in high dimensions. We will discuss what's known and what's not known about the approximation generalization properties of neural network type of hypothesis space as well as the dynamics and generalization properties of the training process. We will also discuss the relative merits of shallow vs. deep neural network models and suggest ways to formulate more robust machine learning models.

This is joint work with Chao Ma, Stephan Wojtowytsch and Lei Wu.