# Past Events

# 2022

**Wed Sept 21**12 noon ET

**Implications of the implicit bias in neural networks**

When training large neural networks, there are generally many weight combinations that will perfectly fit the training data. However, gradient-based training methods somehow tend to reach those which generalize well, and understanding this "implicit bias" has been a subject of extensive research. In this talk, I will discuss recent works which show several implications of the implicit bias (in homogeneous neural networks trained with the logistic loss): (1) In shallow univariate neural networks the implicit bias provably implies generalization; (2) By using a characterization of the implicit bias, it is possible to reconstruct a significant fraction of the training data from the parameters of a trained neural network, which might shed light on representation learning and memorization in deep learning, but might also have negative potential implications on privacy; (3) In certain settings, the implicit bias provably implies convergence to non-robust networks, i.e., networks which are susceptible to adversarial examples.

Based on joint works with Niv Haim, Itay Safran, Gilad Yehudai, Michal Irani, Jason D. Lee, and Ohad Shamir.

**Wed Sept 14**12 noon ET

**High-dimensional asymptotics of feature learning in the early phase of NN training**

We study the first gradient descent step on the first-layer weights W in a two-layer neural network, where the parameters are randomly initialized, and the training objective is the empirical MSE loss. In the proportional asymptotic limit (where the training set size n, the number of input features d, and the network width N all diverge at the same rate), and under an idealized student-teacher setting, we show that the first gradient update contains a rank-1 "spike", which results in an alignment between the first-layer weights and the linear component of the teacher model f*. To characterize the impact of this alignment, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on W with learning rate \eta. We consider two scalings of the first step learning rate \eta. For small \eta, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random feature model, but cannot defeat the best linear model on the input. Whereas for sufficiently large \eta, we prove that for certain f^*, the same ridge estimator on trained features can go beyond this "linear regime" and outperform a wide range of (fixed) kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.

**Wed July 6**

12 noon ET

**Compressing Variational Bayes**

Recent research has shown surprising connections between variational inference (VI) and data compression. This talk will review several theoretical and applied contributions in this domain. I will first show VI enables estimating the information rate-distortion function of a real-world data set, a quantity that could previously not be estimated beyond a few dimensions. I will then show how semi-amortized variational inference can improve neural image compression, how posterior uncertainties enable compression at variable bitrates, and how sequential variational autoencoders can be converted into practical video codecs. Finally, I show how neural video codecs can inspire design choices for extending diffusion models to video data.

**Wed June 22**

6 am ET

NOTE THE DIFFERENT TIME

**SGD Through the Lens of Kolmogorov Complexity**

In this talk, we present global convergence guarantees for stochastic gradient descent (SGD) via an entropy compression argument. We do so under two main assumptions: (1. Local progress) The model accuracy improves on average over batches. (2. Models compute simple functions) The function computed by the model has low Kolmogorov complexity. It is sufficient that these assumptions hold only for a tiny fraction of the epochs. Intuitively, our results imply that intermittent local progress of SGD implies global progress. Assumption 2 trivially holds for underparameterized models, hence, our work gives the first convergence guarantee for general, underparameterized models. Furthermore, this is the first result that is completely model agnostic - we don't require the model to have any specific architecture or activation function, it may not even be a neural network.

**Wed June 8 **12 noon ET

**Symmetry in Neural Network Parameters and Non-Linearities**

The success of equivariant neural networks has made clear the advantages of considering symmetries of data domains and task functions when designing model architectures. Here, we consider instead symmetries inside the parameter space of neural networks. Group actions on the parameter space which leave the loss invariant can be exploited to accelerate optimization or perform model compression. We study, in particular, a class of models with radial rescaling activations which provide an interesting alternative to pointwise activations and increase symmetry in the parameter space. We prove this class has universal approximation even in the case of fixed width and unbounded domain. We also prove a lossless compression algorithm and show that gradient descent for the compressed model corresponds to a form of projected gradient descent for the original model. In the general case, I will discuss non-linear group actions in the parameter space which can be used to teleport model parameters and increase convergence rate. This talk includes joint work with Iordan Ganev and Twan van Laarhoven as well as Bo Zhao, Nima Dehmamy, and Rose Yu.

**Wed May 25**

12 noon ET

**Beyond Regression: Operators and Extrapolation in Machine Learning**

In this talk we first suggest a unification of regression-based machine learning methods, including kernel regression and various types of neural networks. We then consider the limitations of such methods, especially the curse-of-dimensionality, and the various potential solutions that have been proposed including: (1) Barron's existence result, (2) leveraging regularity, and (3) assuming special structure in the data such as independence or redundancy. Finally, we consider operator-learning and extrapolation as emerging directions for machine learning. Operator-learning is the more developed of the two, and we show how learning operators allows intrinsic regularization, uncertainty quantification, and can represent many-to-one and one-to-many mappings. However, extrapolation remains the final frontier in machine learning, and we discuss an emerging approach and the mathematics that may underly it.

**Wed May 11**

12 noon ET

**Smale’s 18th Problem and the Barriers of Deep Learning**

Deep learning (DL) has had unprecedented success and is now rapidly entering scientific computing (SC). However, DL suffers from a universal phenomenon: instability, despite universal approximation results that often guarantee the existence of stable and accurate neural networks (NNs). We show the following paradox. There are well-conditioned problems in SC where one can prove the existence of NNs with great approximation qualities, however, there does not exist any algorithm that can train such a NN. For any positive integers *n>2* and M, there are cases where simultaneously: (a) no algorithm can train a NN correct to *n* digits, (b) there exists an algorithm that trains a NN with *n-1* correct digits, but any such algorithm needs arbitrarily many training data, (c) there exists an algorithm that trains a NN with *n-2* correct digits using no more than *M* training samples. These results provide basic foundations for Smale's 18th problem and imply a classification theory describing conditions under which (stable) NNs with a given accuracy can be trained. We begin this theory by initiating a unified theory for compressed sensing and DL, leading to sufficient conditions for the existence of training algorithms for stable NNs in inverse problems. We introduce FIRENETs, which we prove and numerically verify are stable. FIRENETs only require O(log(1/eps)) hidden layers for an eps-accurate solution to the inverse problem.

[1] M.J. Colbrook, V. Antun, and A.C. Hansen, “The difficulty of computing stable and accurate neural networks: On the barriers of deep learning and Smale’s 18th problem,” PNAS 119.12 (2022).

[2] M.J. Colbrook, “WARPd: A linearly convergent first-order method for inverse problems with approximate sharpness conditions.” arXiv:2110.12437 (2021).

[3] C.Q. Choi, “Some AI Systems May Be Impossible to Compute,” IEEE Spectrum, https://spectrum.ieee.org/deep-neural-network

**Wed May 4**

12 noon ET

**Understanding and improving generalization in multitask and transfer learning**

A broad research question in ML is how to build models that generalize across various tasks. This talk will describe my recent works addressing this question. First, I will talk about some new analyses of multitask learning, which explains negative transfer using recent developments in deep learning theory. Second, I will talk about transfer learning from pretrained deep neural networks. I will describe our approach to avoid overfitting, with provable generalization using Hessians. Lastly, I will briefly talk about a related multigroup learning problem, and a recent contrastive learning approach for solving this problem.

**Wed April 27**

12 noon ET

**Explaining Neural Network Classifiers: Hurdles and Progress**

Neural Networks have become the standard tools for high-dimensional decision making, e.g. medical imaging, autonomous driving, playing complex games.

Even in high-stakes areas they generally operate as black-box algorithms without a legible decision process. This has birthed the field of explainable artificial intelligence (XAI). The first step for XAI-methods is to discern between the relevant and irrelevant input components for a decision.

In this talk, we formalise this idea by extending the concept of prime implicants from abductive reasoning to a probabilistic setting. This setting captures what many XAI practitioners intuitively aim for.

We show that finding such small implicants, even approximately, is a computationally hard problem. Furthermore, good solutions depend strongly on the underlying probability distribution. We present strategies to overcome both problems and discuss what challenges still remain.

**Wed April ****20**** **12 noon ET

**Computational Graph Completion**

We present a framework for generating, organizing, and reasoning with computational knowledge. It is motivated by the observation that most problems in Computational Sciences and Engineering (CSE) can be formulated as that of completing (from data) a computational graph (or hypergraph) representing dependencies between functions and variables. Nodes represent variables, and edges represent functions. Functions and variables may be known, unknown, or random. Data comes in the form of observations of distinct values of a finite number of subsets of the variables of the graph (satisfying its functional dependencies). The underlying problem combines a regression problem (approximating unknown functions) with a matrix completion problem (recovering unobserved variables in the data). Replacing unknown functions by Gaussian Processes (GPs) and conditioning on observed data provides a simple but efficient approach to completing such graphs. Since this completion process can be automatized, as one solves $\sqrt{2}$ on a pocket calculator without thinking about it, one could, with the proposed framework, solve a complex CSE problem by drawing a diagram.

**Wed April 13 **12 noon ET

**Towards understanding the role of noise in non-convex machine learning dynamics**

It has been empirically shown that the noise induced by the Stochastic Gradient Descent algorithm when training neural networks generally enhances its generalisation performance in comparison to full-batch training (gradient descent). In this talk, we will try to understand how SGD-like noise biases the training dynamics towards specific prediction functions for regression tasks. More precisely, we will first show that the dynamics of SGD over diagonal linear networks converges towards a sparser linear estimator than the one retrieved by GD. Going further, we will also show that adding label noise biases the dynamics towards implicitly solving a Lasso program. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances of stochastic dynamics over deterministic ones, as observed in practice.

**Wed April 6**

12 noon ET

**Deep Learning Approximation of Diffeomorphisms via Linear-Control Systems**

In the last years it has been observed that Residual Neural Networks (ResNets) can be interpreted as discretizations of control systems. This can be a valuable tool for a further mathematical understanding of Machine Learning, since it bridges ResNets (and, more generally, Deep Learning) with Control Theory. This parallelism can be useful to study existing architectures and to develop new ones. In particular, in the present seminar we investigate ResNets obtained from linear-control systems. Despite their simplicity, recent theoretical results guarantee that they could be surprisingly expressive. We will focus on the problem of producing an approximation of a diffeomorphism after observing its action on a finite ensemble of points (the dataset). In this framework, the training of the ResNet corresponds to the resolution of a proper Optimal Control problem. Finally, we will see that, owing to the linear dependence of the system in the controls, the training algorithms based on Pontryagin Maximum Principle can be carried out with low computational effort.

# 2021

**Wed December 8**

12 noon ET

**R-Adaptivity, Deep Learning and Optimal Transport**

PINNS (physics inspired neural networks) have recently become popular as a means of solving ODEs and PDES by using the tools of deep learning. They have both shown promise for solving some differential equations, and have struggled to solve others. Whilst advertised as being 'mesh free methods' they do rely on the use of collocation points. The accuracy of the numerical solution of PDEs using Finite Element methods depends crucially on the choice of an appropriate mesh. This can be obtained an r-adaptive strategy, which equidistributes the error over the mesh elements based on a-priori/posteriori knowledge of the solution. The core of this talk will describe how r-adaptivity can be useful in the context of Deep Learning. First, we will show that a one-dimensional mesh can be equidistributed by training a feed forward Neural Network. This approach yields better results than other standard numerical methods. We will then explain the training process of Physics-informed Neural Networks (PINNs) for solving Boundary value problems (BVPs) and show numerical results for a reaction-diffusion and convection-dominated equation. It appears that PINNs fail to be trained in the latter case unless the homotropy method is employed. Finally, we will introduce the Deep-Ritz-Network (DRN) for solving the Poisson equation on a non-convex 2-dimensional domain. If the collocation points are uniformly random sampled and fixed for the entire training process, we obtain a solution with poor accuracy. On the contrary, the adoption of an Optimal Transport strategy, which determines the 'optimal' collocation points, results in a more stable training process and a much more accurate solution.

**Wed December 1**

12 noon ET

**Deep Learning in High Dimension: Neural Network Approximation of Analytic Maps of Gaussians**

For artificial deep neural networks with ReLU activation,

we prove new expression rate bounds for parametric, analytic functions where the parameter dimension could be infinite.

Approximation rates are in mean square on the unbounded parameter range with respect to product gaussian measure. Approximation rate bounds are free from the CoD, and determined by summability of Wiener-Hermite PC expansion coefficients.

Sufficient conditions for summability are quantified holomorphy on products of strips in the complex domain. Applications comprise DNN expression rate bounds of deep-NNs for response surfaces of elliptic PDEs with log-gaussian random field inputs, and for the posterior densities of the corresponding Bayesian inverse problems.

Variants of proofs which are constructive are outlined.

(joint work with Jakob Zech, University of Heidelberg, Germany, and with Dinh Dung and Nguyen Van Kien, Hanoi, Vietnam)

References:

**Wed November 24**

12 noon ET

**Wasserstein Embeddings in the Deep Learning Era**

Computational optimal transport has found many applications in machine learning and, more specifically, deep learning as a fundamental tool to manipulate and compare probability distributions. The Wasserstein distances arising from the optimal transport problem have been of particular interest in recent years. However, a consistent roadblock against the more prevalent use of transport-based methods has been their computational cost. Besides the more well-known ideas for faster computational approaches, including entropy regularization, several fundamental concepts have emerged that enable the integration of transport-based methods as part of the computational graph of a deep neural network. Sliced-Wasserstein distances and the Linear Optimal Transport (LOT) framework are among fundamental concepts well suited for integration into today's deep neural networks. In this talk, we will present the idea of Linear Optimal Transport (otherwise known as the Wasserstein Embedding) and its extension to Sliced-Wasserstein Embeddings and demonstrate their various applications in deep learning with a particular interest in learning from graphs and set-structured data. The talk will be an overview of our recent ICLR 2021 and NeurIPS 2021 publications.

**Wed November 10**

12 noon ET

**Mean field theory in Inverse Problems: from Bayesian inference to overparameterization of networks**

Bayesian sampling and neural networks are seemingly two different machine learning areas, but they both deal with many particle systems. In sampling, one evolves a large number of samples (particles) to match a target distribution function, and in optimizing over-parameterized neural networks, one can view neurons particles that feed each other information in the DNN flow. These perspectives allow us to employ mean-field theory, a powerful tool that translates dynamics of many particle system into a partial differential equation (PDE), so rich PDE analysis techniques can be used to understand both the convergence of sampling methods and the zero-loss property of over-parameterization of ResNets. We showcase the use of mean-field theory in these two machine learning areas, and we also invite the audience to brainstorm other possible applications.

**Wed November 3**

12 noon ET

**Joint Reconstruction-Segmentation with Graph PDEs**

In most practical image segmentation tasks, the image to be segmented will need to first be reconstructed from indirect, damaged, and/or noisy observations. Traditionally, this reconstruction-segmentation task would be done in sequence: first apply the reconstruction method, and then the segmentation method. Joint reconstruction-segmentation is a method for using segmentation and reconstruction techniques simultaneously, to use information from the segmentation to guide the reconstruction, and vice versa. Past work on this has employed relatively simple segmentation algorithms, such as the Chan–Vese algorithm. In this talk, we will demonstrate how joint reconstruction-segmentation can be done using the graph-PDE-based segmentation techniques developed by Bertozzi & Flenner (2012) and Merkurjev, Kostic, & Bertozzi (2013), with ideas drawn from Budd & van Gennip (2020) and Budd, van Gennip, & Latz (2021).

**Wed October 27**

12 noon ET

**Asymptotic Analysis of Deep Residual Networks**

Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature: one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. The scaling regime one ends up with depends on certain features of the network architecture, such as the smoothness of the activation function. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit. In the case where the scaling limit is a stochastic differential equation, the deep network limit is shown to be described by a system of forward-backward stochastic differential equations. Joint work with: Alain-Sam Cohen (InstaDeep Ltd), Alain Rossier (Oxford), RenYuan Xu (University of Southern California).

**Wed October 20**

12 noon ET

**Kernel Stein Discrepancy Descent**

Among dissimilarities between probability distributions, the Kernel Stein Discrepancy (KSD) has received much interest recently. We investigate the properties of its Wasserstein gradient flow to approximate a target probability distribution known up to a normalization constant. This leads to a straightforwardly implementable, deterministic score-based method, named KSD Descent, which uses a set of particles to approximate the target distribution. Remarkably, owing to a tractable loss function, KSD Descent can leverage robust parameter-free optimization schemes such as L-BFGS; this contrasts with other popular particle-based schemes such as the Stein Variational Gradient Descent algorithm. We study the convergence properties of KSD Descent and demonstrate its practical relevance. However, we also highlight failure cases by showing that the algorithm can get stuck in spurious local minima.

**Wed October 13**

12 noon ET

**What Kinds of Functions Do Neural Networks Learn?**

Neural nets have made an amazing comeback during the past decade. Their empirical success has been truly phenomenal, but neural nets are poorly understood in a mathematical sense compared to classical methods like splines, kernels, and wavelets. This talk describes recent steps towards a mathematical theory of neural networks comparable to the foundations we have for classical nonparametric methods. Surprisingly, neural nets are minimax optimal in a wide variety of classical univariate function spaces, including those handled by splines and wavelets. In multivariate settings, neural nets are solutions to data-fitting problems cast in entirely new types of multivariate function spaces characterized through total variation (TV) measured in the Radon transform domain. Deep multilayer neural nets naturally represent compositions of functions in these Radon-TV spaces. This theory provides novel explanations for many notable empirical discoveries in deep learning and suggests new approaches to training neural networks.

This is joint work with Rahul Parhi.

**Wed October 6**

12 noon ET

**Approximation properties of two-layer neural networks with values in a Banach space**

Approximation properties of infinitely wide neural networks have been studied by several authors in the last few years. New function spaces have been introduced that consist of functions that can be efficiently (i.e., with dimension-independent rates) approximated by neural networks of finite width. Typically, these functions are supposed to act between Euclidean spaces, typically with a high-dimensional input space and a lower-dimensional output space. As neural networks gain popularity in inherently infinite-dimensional settings such as inverse problems and imaging, it becomes necessary to analyse the properties of neural networks as nonlinear operators acting between infinite-dimensional spaces. In this talk, I will present dimension-independent Monte-Carlo rates for neural networks acting between Banach spaces with a partial order (vector lattices), where the ReLU nonlinearity will be interpreted as the lattice operation of taking the positive part.

**Wed September 22**

12 noon ET

**Robust W-GAN-Based Estimation Under Wasserstein Contamination**

Robust estimation is an important problem in statistics which aims at providing a reasonable estimator when the data-generating distribution lies within an appropriately defined ball around an uncontaminated distribution. Although minimax rates of estimation have been established in recent years, many existing robust estimators with provably optimal convergence rates are also computationally intractable. In this talk, we study several estimation problems under a Wasserstein contamination model and present computationally tractable estimators motivated by generative adversarial networks (GANs). Specifically, we analyze properties of Wasserstein GAN-based estimators for location estimation, covariance matrix estimation, and linear regression and show that our proposed estimators are minimax optimal in many scenarios. Finally, we present numerical results which demonstrate the effectiveness of our estimators. This is joint work with Zheng Liu (UW-Madison).

**Wed September 15**

12 noon ET

**Bilevel Learning for Inverse Problems**

Variational regularization techniques are dominant in the field of inverse problems. A drawback of these techniques is that they are dependent on a number of parameters which have to be set by the user. This issue can be approached by machine learning where we estimate these parameters from data. This is known as "Bilevel Learning" and has been successfully applied to many tasks, some as small-dimensional as learning a regularization parameter, others as high-dimensional as learning a sampling pattern in MRI. While mathematically appealing this strategy leads to a nested optimization problem which is computationally difficult to handle. In this talk we discuss several applications of bilevel learning for imaging as well as new computational approaches. There are quite a few open problems in this relatively recent field of study, some of which I will highlight along the way.

**Wed September 8**

12 noon ET

**Generalized Energy-Based Models**

I will describe Generalized Energy Based Models (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model, as in a Generative Adversarial Network), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). In particular, while the energy function is analogous to the GAN critic function, it is not discarded after training. GEBMs are trained by alternating between learning the energy and the base, much like a GAN. Both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via Langevin diffusion-based methods (MALA, UAL, HMC), thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form.

**Wed September 1**

12 noon ET

**Learning Linear Operators**

Neural networks have shown great success at learning function approximators between spaces X and Y. In many problems arising in physics it is desirable to learn maps between spaces of functions X and Y; this may be either for the purposes of scientific discovery, or to provide cheap surrogate models which accelerate computations. New ideas are needed to successfully address this learning problem in a scalable, efficient manner. I will highlight recent progress in this area, explaining different approaches taken, and describing numerical results which demonstrate empirical success of the methodology.

I will then describe recent work concerned with this problem in the setting of learning linear operators between Hilbert spaces, distinguishing between compact, bounded and unbounded operators; the work is hence relevant to the question of learning solution operators for inverse problems, as well as forward problems. A Bayesian approach is adopted, and linked to a statistical learning perspective. Sharp rates are obtained for posterior contraction, and related to bounds on the excess risk and generalization gap.

Joint work with Maarten De Hoop, Nik Kovachki and Nick Nelsen.

**Wed August 25**

12 noon ET

**Structure-preserving machine learning for inverse problems**

Inverse problems naturally arise in many scientific settings and the study of these problems has been crucial in the development of important technologies such as medical imaging. In inverse problems, the goal is to estimate an underlying ground truth object, typically an image, from corresponding measurements, where the measurements and the ground truth are connected by some forward operator and noise-generating process (both of which are generally assumed to be known). The solution of inverse problems is usually complicated by ill-posedness. Variational regularisation is a well-established approach to overcoming this ill-posedness. In this approach an image is reconstructed from measurements by solving a minimisation problem that trades off data fit with a penalty on unrealistic images. While this approach has proven very successful, it generally requires the parts that make up the optimisation problem to be carefully chosen, and the optimisation problem may require considerable computational effort to solve. There is an active line of research into overcoming these issues using data-driven approaches. In this talk I will discuss ways in which favourable properties of the variational regularisation approach can be combined with a data-driven approach to solving inverse problems. I will conclude by speaking about some interesting directions for future work.

**Wed August 18**

12 noon ET

**Approximating functions, functionals, and operators using deep neural networks for diverse applications**

We will review Physics-informed neural network and summarize available theoretical results.. We will also introduce new NNs that learn functionals and nonlinear operators from functions and corresponding responses for system identification. The universal approximation theorem of operators is suggestive of the potential of NNs in learning from scattered data any continuous operator or complex system. We first generalize the theorem to deep neural networks, and subsequently we apply it to design a new composite NN with small generalization error, the deep operator network (DeepONet), consisting of a NN for encoding the discrete input function space (branch net) and another NN for encoding the domain of the output functions (trunk net). We demonstrate that DeepONet can learn various explicit operators, e.g., integrals, Laplace transforms and fractional Laplacians, as well as implicit operators that represent deterministic and stochastic differential equations. More generally, DeepOnet can learn multiscale operators spanning across many scales and trained by diverse sources of data simultaneously.

**Wed August 4**

12 noon ET

**Deep neural networks for inverse problems with pseudodifferential operators: an application to limited-angle tomography**

In this talk, I present a novel convolutional neural network (CNN), called $\Psi$DONet, designed for learning pseudodifferential operators in the context of linear inverse problems. The starting point is the Iterative Soft Thresholding Algorithm (ISTA), a well-known algorithm to solve sparsity-promoting minimization problems. I will show that, under rather general assumptions on the forward operator, the unfolded iterations of ISTA can be interpreted as the successive layers of a CNN. In turn this provides fairly general network architectures that, for a specific choice of the parameters involved, allow to reproduce ISTA, or a perturbation of ISTA for which the coefficients of the filters can be bounded. I will use as case study the limited-angle X-ray transform and its application to limited-angle computed tomography (LA-CT). In particular, I will prove that, in the case of LA-CT, the operations of upscaling, downscaling and convolution, which characterize $\Psi$DONet and most deep learning schemes, can be exactly determined by combining the convolutional nature of the limited angle X-ray transform and basic properties defining an orthogonal wavelet system. Finally, I propose two different implementations of $\Psi$DONet which are tested on simulated data from LA geometry, generated from the ellipse data set. This is a joint work with L. Ratti, M. Galinier, M. Lassas, M. Prato and S. Siltanen.

**Wed July 28**

12 noon ET

**PDE Inspired Graph Neural Networks **

In this talk we discuss new architectures for graph neural networks. We show that by small modifications of existing architectures we obtain networks with predicted characters that can be tailored for specific tasks. We derive networks that are based on diffusion and on wave propagation on a graph and show their theoretical as well as their practical characters. This understanding enables us to either compare or beat the state of the art for many known benchmarks.

**Wed July 21**

12 noon ET

**Geometric Deep Learning: Grids, Graphs, Groups, Geodesics and Gauges**

The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to be beyond reach –such as computer vision, playing Go, or protein folding – are in fact feasible with appropriate computational scale. Remarkably, the essence of deep learning is built from two simple algorithmic principles: first, the notion of representation or feature learning, whereby adapted, often hierarchical, features capture the appropriate notion of regularity for each task, and second, learning by local gradient-descent type methods, typically implemented as backpropagation.

While learning generic functions in high dimensions is a cursed estimation problem, most tasks of interest are not generic, and come with essential pre-defined regularities arising from the underlying low-dimensionality and structure of the physical world. This talk is concerned with exposing these regularities through unified geometric principles that can be applied throughout a wide spectrum of applications.

Such a ‘geometric unification’ endeavour in the spirit of Felix Klein's Erlangen Program serves a dual purpose: on one hand, it provides a common mathematical framework to study the most successful neural network architectures, such as CNNs, RNNs, GNNs, and Transformers. On the other hand, it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented.

**Wed July 14**

**Advances of Momentum in Optimization Algorithms and Neural Architecture Design**

We will present a few recent results on leveraging momentum techniques to improve stochastic optimization and neural architecture design.

First, designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. Also, we show the empirical advantage of the momentum-enhanced RNNs over the baseline models.

Second, we will present the recent advances of adaptive momentum in accelerating the stochastic gradient descent (SGD). The adaptive momentum-assisted SGD remarkably improves the deep neural network training in terms of acceleration and improved generalization and significantly reduces the effort for hyperparameter tuning.

**Wed June 23**

12 noon ET

**Variational models and gradient flows for graph clustering**

Discrete graph-based variants of the Allen--Cahn and total variation variational models have proven to be successful tools for clustering and classification on graphs. In this talk we will study these models and the gradient flows that are derived from them. We will see deep connections between the various discrete gradient flows as well as between the discrete gradient flows and their continuum relatives.

**Wed June 9**

12 noon ET

**Convergence of Stochastic Gradient Descent for analytic target functions**

In this talk we discuss almost sure convergence of Stochastic Gradient Descent in discrete and continuous time for a given twice continuously-differentiable target function F. In a first step we give assumptions on the step-sizes and perturbation size to ensure convergence of the target value F and gradient f=DF assuming that f is locally Hölder-continuous. This result entails convergence of the iterates itself in the case where F does not possess a continuum of critical points.

In a general non-convex setting with F possibly containing a rich set of critical points, convergence of the process itself is sometimes taken for granted, but actually is a non-trivial issue as there are solutions to the gradient flow ODE for smooth target functions that stay in a compact set but do not converge. Using the Lojasiewicz-inequality we give sharp bounds on the step-sizes and the size of the perturbation in order to guarantee convergence of the SGD scheme for analytic target functions. Also, we derive the convergence rate of the function value under the assumptions that F satisfies a particular Lojasiewicz-inequality with exponent in [1/2,1). Finally, we compare the discrete and continuous time results and discuss optimality of the assumptions. This is joint work with Steffen Dereich (WWU Münster).

**Wed June 2**

12 noon ET

**Learning based multi-scale modeling**

The behavior of materials involve physics at multiple length and time scales: electronic, atomistic, domains, defects etc. The engineering properties that we observe and exploit in application are a sum total of all these interactions. Multiscale modeling seeks to understand how the physics at the finer scales affect the coarser scales. This can be challenging for two reasons. First, it is computationally expensive due to the need to repeatedly solve the finer scale model. Second, it requires a priori (empirical) knowledge of the aspects of the finer-scale behavior that affect the coarser scale (order parameters, state variables, descriptors, etc.). This is especially challenging in situations where the behavior depends on time. We regard the solution of the finer-scale model as an input-output map (possibly between infinite dimensional spaces), and introduce a a general framework for the data-driven approximation of such maps. The proposed approach is motivated by the recent successes of neural networks and deep learning, in combination with ideas from model reduction. This combination results in a neural network approximation that is computationally inexpensive, independent of the need for a priori knowledge, and can be used directly in the coarser scale calculations. We demonstrate the ideas with examples drawn from first principles study of defects and crystal plasticity study of inelastic impact. The work draws from collaborations with the Caltech PDE-ML group and in particular Burigede Liu, Nikola Kovachki and Ying Shi Teh.

Kaushik Bhattacharya is Howell N. Tyson, Sr., Professor of Mechanics and Professor of Materials Science as well as the Vice-Provost at the California Institute of Technology. He received his B.Tech degree from the Indian Institute of Technology, Madras, India in 1986, his Ph.D from the University of Minnesota in 1991 and his post-doctoral training at the Courant Institute for Mathematical Sciences during 1991-1993. He joined Caltech in 1993. His research concerns the mechanical behavior of materials, and specifically uses theory to guide the development of new materials. He has received the von Kármán Medal of the Society of Industrial and Applied Mathematics (2020), Distinguished Alumni Award of the Indian Institute of Technology, Madras (2019), the Outstanding Achievement Award of the University of Minnesota (2018), the Warner T. Koiter Medal of the American Society of Mechanical Engineering (2015) and the Graduate Student Council Teaching and Mentoring Award at Caltech (2013).

**Wed May 26**

12 noon ET

**The Stein geometry in machine learning: gradient flows, optimal transport and large deviations**

Sampling or approximating high-dimensional probability distributions is a key challenge in computational statistics and machine learning. This talk will present connections to gradient flow PDEs, optimal transport and interacting particle systems, focusing on the recently introduced Stein variational gradient descent methodology and some variations. The construction induces a novel geometrical structure on the set of probability distributions related to a positive definite kernel function. We discuss the corresponding geodesic equations, infinitesimal optimal transport maps, as well as large deviation functionals. This is joint work with A. Duncan (Imperial College London), L. Szpruch (University of Edinburgh) and M. Renger (Weierstrass Institute Berlin).

**Wed May 19**

12 noon ET

**Smooth bilevel programming for sparse regularisation**

Nonsmooth regularisers are widely used in machine learning for enforcing solution structures (such as the l1 norm for sparsity or the nuclear norm for low rank). State of the art solvers are typically first order methods or coordinate descent methods which handle nonsmoothness by careful smooth approximations and support pruning. In this work, we revisit the approach of iteratively reweighted least squares (IRLS) and show how a simple reparameterization coupled with a bilevel resolution leads to a smooth unconstrained problem. We are therefore able to exploit the machinery of smooth optimisation, such as BFGS, to obtain local superlinear convergence. The result is a highly versatile approach which is able to significantly outperform state of the art methods for a wide range of problems.

The video recording will be available until 20 June 2021.

**Wed May 12**

12 noon ET

**Transport information Bregman divergences**

In this talk, we talk about a joint intersection between optimal transport and information geometry. We study Bregman divergences in probability density space embedded with the Wasserstein-2 metric. Several properties and dualities of transport Bregman divergences are provided. In particular, we derive the transport Kullback-Leibler (KL) divergence by a Bregman divergence of negative Boltzmann-Shannon entropy in Wasserstein-2 space. We also derive analytical formulas of transport KL divergence for one-dimensional probability densities and Gaussian families.

**Wed May 5**

12 noon ET

**Adversarial Classification, Optimal Transport, and Geometric Flows**

The purpose of this talk is to provide an explicit link between the three topics that form the talk's title, and to introduce a new perspective (more dynamic and geometric) to understand robust classification problems. For concreteness, we will discuss a version of adversarial classification where an adversary is empowered to corrupt data inputs up to some distance \epsilon. We will first describe necessary conditions associated with the optimal classifier subject to such an adversary. Then, using the necessary conditions we derive a geometric evolution equation which can be used to track the change in classification boundaries as \veps varies. This evolution equation may be described as an uncoupled system of differential equations in one dimension, or as a mean curvature type equation in higher dimension. In one dimension we rigorously prove that one can use the initial value problem starting from \veps=0, which is simply the Bayes classifier, to solve for the global minimizer of the adversarial problem. Global optimality is certified using a duality principle between the original adversarial problem and an optimal transport problem. Several open questions and directions for further research will be discussed.

**Wed Apr 28**

12 noon ET

**Barriers to Deploying Deep Learning Models During the COVID-19 Pandemic**

A promising application for deep learning models is in assisting clinicians with interpreting X-ray and CT scans, especially when treating respiratory diseases. At the onset of the COVID-19 pandemic, radiologists had to quickly learn how to identify a new disease on chest X-rays and CT scans, and use this information to decide how to allocate scarce resources like ventilators. Researchers around the world developed deep learning models to help clinicians with these decisions, and some models were deployed after only three weeks of testing.

Our group reviewed over 1,000 studies that introduce deep learning models for interpreting chest X-rays or CT scans of COVID-19 patients to determine which models, if any, have the potential to help clinicians during the pandemic. In this talk, I will present our findings and discuss how this pandemic could inform researchers creating deployable deep learning models in healthcare.

This talk is based on the paper [1] (arXiv version). The recording can be found here.

[1] Roberts, M., Driggs, D., Thorpe, M., and the AIX-COVNET Collaboration. "Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans”. *Nat. Mach. Intel.* 3, 199–217 (2021).

**Wed Apr 21**

12 noon ET

**Machine Learning and Dynamical Systems meet in Reproducing Kernel Hilbert Spaces **** **

Since its inception in the 19th century through the efforts of Poincaré and Lyapunov, the theory of dynamical systems addresses the qualitative behaviour of dynamical systems as understood from models. From this perspective, the modeling of dynamical processes in applications requires a detailed understanding of the processes to be analyzed. This deep understanding leads to a model, which is an approximation of the observed reality and is often expressed by a system of Ordinary/Partial, Underdetermined (Control), Deterministic/Stochastic differential or difference equations. While models are very precise for many processes, for some of the most challenging applications of dynamical systems (such as climate dynamics, brain dynamics, biological systems or the financial markets), the development of such models is notably difficult. On the other hand, the field of machine learning is concerned with algorithms designed to accomplish a certain task, whose performance improves with the input of more data. Applications for machine learning methods include computer vision, stock market analysis, speech recognition, recommender systems and sentiment analysis in social media. The machine learning approach is invaluable in settings where no explicit model is formulated, but measurement data is available. This is frequently the case in many systems of interest, and the development of data-driven technologies is becoming increasingly important in many applications. The intersection of the fields of dynamical systems and machine learning is largely unexplored and the objective of this talk is to show that working in reproducing kernel Hilbert spaces offers tools for a data-based theory of nonlinear dynamical systems. In this talk, we introduce a data-based approach to estimating key quantities which arise in the study of nonlinear autonomous, control and random dynamical systems. Our approach hinges on the observation that much of the existing linear theory may be readily extended to nonlinear systems - with a reasonable expectation of success- once the nonlinear system has been mapped into a high or infinite dimensional Reproducing Kernel Hilbert Space. In particular, we develop computable, non-parametric estimators approximating controllability and observability energies for nonlinear systems. We apply this approach to the problem of model reduction of nonlinear control systems. It is also shown that the controllability energy estimator provides a key means for approximating the invariant measure of an ergodic, stochastically forced nonlinear system. We also show how kernel methods can be used to detect critical transitions for some multi scale dynamical systems. We also use the method of kernel flows to predict some chaotic dynamical systems. Finally, we show how kernel methods can be used to approximate center manifolds, propose a data-based version of the centre manifold theorem and construct Lyapunov functions for nonlinear ODEs. This is joint work with Jake Bouvrie (MIT, USA), Peter Giesl (University of Sussex, UK), Christian Kuehn (TUM, Munich/Germany), Romit Malik (ANNL), Sameh Mohamed (SUTD, Singapore), Houman Owhadi (Caltech), Martin Rasmussen (Imperial College London), Kevin Webster (Imperial College London), Bernard Hasasdonk and Dominik Wittwar (University of Stuttgart), Gabriele Santin (Fondazione Bruno Kessler).

**Wed Apr 7**

12 noon ET

**Discrete Optimization Methods for Group Model Selection in Compressed Sensing**

In this talk we study the problem of signal recovery for group models. More precisely for a given set of groups, each containing a small subset of indices, and for given linear sketches of the true signal vector which is known to be group-sparse in the sense that its support is contained in the union of a small number of these groups, we study algorithms which successfully recover the true signal just by the knowledge of its linear sketches. We derive model projection complexity results and algorithms for more general group models than the state-of-the-art. We consider two versions of the classical Iterative Hard Thresholding algorithm (IHT). The classical version iteratively calculates the exact projection of a vector onto the group model, while the approximate version (AM-IHT) uses a head- and a tail-approximation iteratively. We apply both variants to group models and analyze the two cases where the sensing matrix is a Gaussian matrix and a model expander matrix.

slides (updated concerning Head and Tail approximation)

**Wed Mar 24**

12 noon ET

**Random walks and PDEs in graph-based learning**

I will discuss some applications of random walks and PDEs in graph-based learning, both for theoretical analysis and algorithm development. Graph-based learning is a field within machine learning that uses similarities between datapoints to create efficient representations of high-dimensional data for tasks like semi-supervised classification, clustering and dimension reduction. There has been considerable interest recently in semi-supervised learning problems with very few labeled examples (e.g., 1 label per class). The widely used Laplacian regularization is ill-posed at low label rates and gives very poor classification results. In the first part of the talk, we will use the random walk interpretation of the graph Laplacian to precisely characterize the lowest label rate at which Laplacian regularized semi-supervised learning is well-posed. At lower label rates, where Laplace learning performs poorly, we will show how our random walk analysis leads to a new algorithm, called Poisson learning, that is probably more stable and informative than Laplace learning. We will conclude with some applications of Poisson learning to image classification and mesh segmentation of broken bone fragments of interest in anthropology.

**Wed Mar 17**

12 noon ET

**Function Approximation via Sparse Random Fourier Features**

Random feature methods have been successful in various machine learning tasks, are easy to compute, and come with theoretical accuracy bounds. They serve as an alternative approach to standard neural networks since they can represent similar function spaces without a costly training phase. However, for accuracy, random feature methods require more measurements than trainable parameters, limiting their use for data-scarce applications or problems in scientific machine learning. This paper introduces the sparse random feature method that learns parsimonious random feature models utilizing techniques from compressive sensing. We provide uniform bounds on the approximation error for functions in a reproducing kernel Hilbert space depending on the number of samples and the distribution of features. The error bounds improve with additional structural conditions, such as coordinate sparsity, compact clusters of the spectrum, or rapid spectral decay. We show that the sparse random feature method outperforms shallow networks for well-structured functions and applications to scientific machine learning tasks.

*Link available upon request until March 24th.*

**Wed Mar 10**

12 noon ET

**Finite Width, Large Depth Neural Networks as Perturbatively Solvable Models**

Deep neural networks are often considered to be complicated "black boxes," for which a systematic analysis is not only out of reach but potentially impossible. In this talk, which is based on ongoing joint work with Dan Roberts and Sho Yaida, I will make the opposite claim. Namely, that deep neural networks at initialization are perturbatively solvable models. The perturbative parameter is the width n of the network and we can obtain corrections to all orders in n. Our approach applies to networks at finite width n and large depth L. A key point is an emergent tension between depth and width. Large values of n make neural networks more like Gaussian processes, which are well behaved but incapable of feature learning due to a frozen NTK (at least with standard initialization schemes). Large values of L, in contrast, amplify higher cumulants and change in the NTK, both of which scale with the network aspect ratio L/n.

**Wed Mar 3**

12 noon ET

**Structure preservation and convergence in scientific machine learning **

Physics-informed techniques have emerged as a means of incorporating prior knowledge into machine learning. These techniques generally function by minimizing a hybrid loss, regularizing a traditional $\ell_2$ error with a PDE residual. While remarkably effective, these approaches suffer two major shortcomings. Firstly, such neural network (NN) solutions of PDEs generally fail to converge with increasing architecture size. Despite recent work establishing NNs may approximate at least as well as hp-finite element spaces, in practice when training with gradient methods O(1) optimization errors prevent realizing consistency. Secondly, the regularized losses introduce physics via a penalized residual, and it is well known from classical numerical analysis that the approximation space must be designed in tandem with the residual to ensure converge to a given PDE.

We conjecture that the same tools used to design convergent and structure-preserving properties in forward simulation may be used to design scientific ML architectures with similar guarantees. In this talk, we present two current works which address each of these issues. First, we introduce partition of unity networks (POUnets) to develop convergent approximation with deep networks. It has been shown that traditional feed forward networks may approximate by emulating partitions of unity (POU), and then emulating monomials on each partition, ultimately yielding a localized polynomial approximation and associated hp-convergence. Rather than emulating these components, POUnets function by directly incorporating both the POU and polynomials into the architecture. The resulting approximation breaks the curse of dimensionality and admits a fast least-squares optimization strategy. Predictions are competitive with high-order finite element spaces, and provide superior approximation for problems with reduced regularity.

Secondly, we introduce a data-driven exterior calculus (DDEC) which may be used to endow scientific ML architectures with the structure-preserving properties of mimetic PDE discretization. Traditional mimetic methods function by exploiting the exterior calculus structures offered by a mesh to construct discrete operators that exactly mimic the topological properties of continuum operators. We show how graphs may be used as a surrogate for the topology offered by graphs, and present new network architectures which allows "physics-informed" machine learning which exactly preserves conservation, guarantees extraction of well-posed problems, and allows handling of the non-trivial null-spaces occurring in fields such as electromagnetics.

If time permits, we will additionally share some current results applying these tools in challenging data-driven modeling effort at Sandia, related to data-driven shock hydrodynamics in metals and discovery of surrogates for semiconductors in radiation environments.

**Wed Feb 17**

12 noon ET

**Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability**

In this paper we revisit some classic problems in classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate η. In a recent work, Diakonikolas, Gouleakis, and Tzamos resolved a long-standing problem by giving the first efficient learner which achieves error η+ϵ for any ϵ>0. However, their algorithm outputs a complicated hypothesis which partitions space into poly(d,1/ϵ) regions. Here, we give a simpler algorithm and in the process resolve a number of outstanding open questions: (1) We give the first proper learner for Massart halfspaces that achieves error η+ϵ. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure which converts an arbitrarily complex classifier into an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show that any SQ algorithm requires super-polynomially many queries to achieve error 𝖮𝖯𝖳+ϵ. Moreover, we study classification under generalized linear models where E[Y|X] = \sigma(w \cdot X) for any odd, monotone, and Lipschitz function \sigma This family includes the previously mentioned halfspace models as a special case, but also includes models like logistic regression where the noise level increases near the decision boundary. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Finally, we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties.

**Wed Feb 10**

12 noon ET

**Convergence and optimality of single-layer neural networks for reinforcement learning**

Recent groundbreaking results have established a convergence theory for wide neural networks in the supervised learning setting. Depending on the scaling of parameters at initialization, the (stochastic) gradient descent dynamics of these models converge towards different deterministic limits known as the mean-field and the lazy training regimes.

In this talk, we extend some of these recent results to examples of prototypical algorithms in reinforcement learning: Temporal-Difference (TD) learning and Policy Gradients. In the first case, we prove convergence and optimality of wide neural network training dynamics in the lazy and mean-field regime, respectively. To establish these results, we bypass the lack of gradient structure of the TD learning dynamics by leveraging Lyapunov function techniques in the lazy training regime and sufficient expressivity of the activation function in the mean-field framework. We further show that similar optimality results hold for wide, single layer neural networks trained by entropy-regularized softmax Policy Gradients despite the nonlinear and nonconvex nature of the risk function in this setting.

This is work in collaboration with Jianfeng Lu.

**Wed Feb 3**

12 noon ET

**Total ****V****ariation ****M****inimization on ****G****raphs for ****S****emisupervised and ****U****nsupervised ****M****achine ****L****earning**** **

I will speak about a general class of machine learning problems in which data lives on similarity graphs and the goal is to solve a penalized graph min-cut problem. Applications include semi-supervised learning, unsupervised learning, and modularity optimization – originally developed for community detection on networks – but recast as an unsupervised machine learning problem. These problems have a mathematical connection to total variation minimization in Euclidean space and this analogy leads to a natural class of machine learning algorithms that mimic pseudo-spectral methods in nonlinear partial differential equations. The methods are graph analogues of geometric motion – e.g. Motion by mean curvature and the MBO scheme to approximate that dynamics.

**Wed Jan 20**

12 noon ET

**Geometric Methods for Machine Learning and Optimization**

Many machine learning applications involve non-Euclidean data, such as graphs, strings or matrices. In such cases, exploiting Riemannian geometry can deliver algorithms that are computationally superior to standard (Euclidean) nonlinear programming approaches. This observation has resulted in an increasing interest in Riemannian methods in the optimization and machine learning community.

In the first part of the talk, we consider the task of learning a robust classifier in hyperbolic space. Such spaces have received a surge of interest for representing large-scale, hierarchical data, due to the fact that they achieve better representation accuracy with lower dimensions. We present the first theoretical guarantees for the (robust) large-margin learning problem in hyperbolic space and discuss conditions under which hyperbolic methods are guaranteed to surpass the performance of their Euclidean counterparts. In the second part, we introduce Riemannian Frank-Wolfe (RFW) methods for constraint optimization on manifolds. Here, the goal of the theoretical analysis is two-fold: We first show that RFW converges at a nonasymptotic sublinear rate, recovering the best-known guarantees for its Euclidean counterpart. Secondly, we discuss how to implement the method efficiently on matrix manifolds. Finally, we consider applications of RFW to the computation of Riemannian centroids and Wasserstein barycenters, which are crucial subroutines in many machine learning methods.

Based on joint work with Suvrit Sra (MIT) and Manzil Zaheer, Ankit Singh Rawat, Aditya Menon and Sanjiv Kumar (all Google Research).

**Wed Jan 13**

12 noon ET

**Machine Learned Regularization for Solving Inverse Problems**

Inverse problems are about the reconstruction of an unknown physical quantity from indirect measurements. Most inverse problems of interest are ill-posed and require appropriate mathematical treatment for recovering meaningful solutions. Regularization is one of the main mechanisms to turn inverse problems into well-posed ones by adding prior information about the unknown quantity to the problem, often in the form of assumed regularity of solutions. Classically, such regularization approaches are handcrafted. Examples include Tikhonov regularization, the total variation and several sparsity-promoting regularizers such as the L1 norm of Wavelet coefficients of the solution. While such handcrafted approaches deliver mathematically and computationally robust solutions to inverse problems, providing a universal approach to their solution, they are also limited by our ability to model solution properties and to realise these regularization approaches computationally.

Recently, a new paradigm has been introduced to the regularization of inverse problems, which derives regularization approaches for inverse problems in a data driven way. Here, regularization is not mathematically modelled in the classical sense, but modelled by highly over-parametrised models, typically deep neural networks, that are adapted to the inverse problems at hand by appropriately selected (and usually plenty of) training data.

In this talk, I will review some machine learning based regularization techniques, present some work on unsupervised and deeply learned convex regularisers and their application to image reconstruction from tomographic and blurred measurements, and finish by discussing some open mathematical problems.

# 2020

**Wed Dec 16**

12 noon ET

**The dual of the margin: improved analyses and rates for gradient descent’s implicit bias**

The implicit bias of gradient descent, and specifically its margin maximization properties, have arisen as a promising explanation for the good generalization of deep networks. The purpose of this talk is to demonstrate the effectiveness of a dual problem to smoothed margin maximization. Concretely, this talk will develop this dual, as well as a variety of consequences in linear and nonlinear settings.

In the linear case, this dual perspective firstly will yield fast 1/t rates for margin maximization and implicit bias. This is faster than any prior first-order hard-margin SVM solver, which achieves 1/sqrt{t} at best.

Secondly, the dual analysis also allows a characterization of the implicit bias, even outside the standard setting of exponentially-tailed losses; in this sense, it is gradient descent, and not a particular loss structure which leads to implicit bias.

In the nonlinear case, duality will enable the proof of a gradient alignment property: asymptotically, the parameters and their gradients become colinear. Although abstract, this property in turn implies various existing and new margin maximization results.

Joint work with Matus Telgarsky.

**Wed Dec 09**

12 noon ET

**A PDE Interpretation of Prediction with Expert Advice**

We study the problem of prediction of binary sequences with expert advice in the online setting, which is a classic example of online machine learning. We interpret the binary sequence as the price history of a stock, and view the predictor as an investor, which converts the problem into a stock prediction problem. In this framework, an investor, who predicts the daily movements of a stock, and an adversarial market, who controls the stock, play against each other over N turns. The investor combines the predictions of n ≥ 2 experts in order to make a decision about how much to invest at each turn, and aims to minimize their regret with respect to the best-performing expert at the end of the game. We consider the problem with history-dependent experts, in which each expert uses the previous d days of history of the market in making their predictions. The prediction problem is played (in part) over a discrete graph called the d dimensional de Bruijn graph.

We focus on an appropriate continuum limit and using methods from optimal control, graph theory, and partial differential equations, we discuss strategies for the investor and the adversarial market. We prove that the value function for this game, rescaled appropriately, converges as N → ∞ at a rate of O(N-1/2) (for C4 payoff functions) to the viscosity solution of a nonlinear degenerate parabolic PDE. It can be understood as the Hamilton-Jacobi-Issacs equation for the two-person game. As a result, we are able to deduce asymptotically optimal strategies for the investor.

This is joint work with Robert Kohn and Jeff Calder.

**Wed Nov 25**

12 noon ET

**Neural network performance for classification problems with boundaries of Barron class**

We study classification problems in which the distances between the different classes are not necessarily positive, but for which the boundaries between the classes are well-behaved. More precisely, we assume these boundaries to be locally described by graphs of functions of Barron-class. ReLU neural networks can approximate and estimate classification functions of this type with rates independent of the ambient dimension. More formally, three-layer networks with $N$ neurons can approximate such functions with $L^1$-error bounded by $O(N^{-1/2})$. Furthermore, given $m$ training samples from such a function, and using ReLU networks of a suitable architecture as the hypothesis space, any empirical risk minimizer has generalization error bounded by $O(m^{-1/4})$. All implied constants depend only polynomially on the input dimension. We also discuss the optimality of these rates. Our results mostly rely on the "Fourier-analytic" Barron spaces that consist of functions with finite first Fourier moment. But since several different function spaces have been dubbed "Barron spaces'' in the recent literature, we discuss how these spaces relate to each other. We will see that they differ more than the existing literature suggests.

**Wed Nov 18**

12 noon ET

**Conditional Sampling with Monotone GANs: Modifying Generative Models to Solve Inverse Problems**

Generative models such as GANs, VAEs and Normalizing Flows have been very successful in the unsupervised learning task of generating samples from a high-dimensional probability distribution from its empirical approximation. However, the task of conditioning a high-dimensional distribution from limited empirical samples has attracted less attention in the literature. In this talk we will discuss some ideas in this direction by viewing generative modelling as a measure transport problem. In particular, we present a simple recipe using block-triangular maps and monotonicity constraints that enables standard models such as vanilla GAN to perform conditional sampling. We demonstrate the effectiveness of our method on various examples ranging from synthetic test sets to image in-painting and function space inference in porous medium flow.

The video of the talk is available here.

**Wed Nov 11**

12 noon ET

**A Dynamical Central Limit Theorem for Shallow Neural Networks**

Recent theoretical works have characterized the dynamics of wide shallow neural networks trained via gradient descent in an asymptotic mean-field limit when the number of parameters tends towards infinity. At initialization, the random sampling of the parameters leads to a fluctuation from the mean-field limit dictated by the classical Central Limit Theorem (CLT). However, as gradient descent induces correlations among the parameters, the question of how the deviation evolves during training remains unclear. Here, we prove that the deviation from the mean-field limit scaled by the width, in the width-asymptotic limit, remains bounded throughout training. The upper bound is given by a Monte-Carlo type resampling error, which does not depend explicitly on the dimension. It then motivates the use of the 2-norm of the underlying measure as a regularization term, which is related to the generalization error as well via the theory of variation-norm function spaces. Moreover, if the mean-field dynamics converges to a measure that interpolates the training data, we prove that the asymptotic deviation eventually vanishes in the CLT scaling. We also complement these results with numerical experiments.

**Wed Nov 04**

12 noon ET

**Analysis of Stochastic Gradient Descent in Continuous Time**

Stochastic gradient descent is an optimisation method that combines classical gradient descent with random subsampling within the target functional. In this work, we introduce the stochastic gradient process as a continuous-time representation of stochastic gradient descent. The stochastic gradient process is a dynamical system that is coupled with a continuous-time Markov process living on a finite state space. The dynamical system - a gradient flow - represents the gradient descent part, the process on the finite state space represents the random subsampling. Processes of this type are, for instance, used to model clonal populations in fluctuating environments. After introducing it, we study theoretical properties of the stochastic gradient process. We show that it converges weakly to the gradient flow with respect to the full target function, as the learning rate approaches zero. Moreover, we give assumptions under which the stochastic gradient process is exponentially ergodic in the Wasserstein sense. We then additionally assume that the single target functions are strongly convex and the learning rate goes to zero sufficiently slowly. In this case, the process converges with exponential rate to a distribution arbitrarily close to the point mass concentrated in the global minimum of the full target function. We conclude with a discussion of discretisation strategies for the stochastic gradient process and illustrate our concepts in numerical experiments.

**Wed Oct 28**

12 noon ET

**How Important is the Train-Validation Split in Meta-Learning?**

Meta-learning aims to perform fast adaptation on a new task through learning a “prior” from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct non-splitting method, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful on the linear centroid meta-learning problem, in the asymptotic setting where the number of tasks goes to infinity. We show that the splitting method converges to the optimal prior as expected, whereas the non-splitting method does not in general without structural assumptions on the data. In contrast, if the data are generated from linear models (the realizable regime), we show that both the splitting and non-splitting methods converge to the optimal prior. Further, perhaps surprisingly, our main result shows that the non-splitting method achieves a strictly better asymptotic excess risk under this data distribution, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that data splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the non-splitting method can indeed outperform the splitting method, on both simulations and real meta-learning tasks.

**Wed Oct 21**

12 noon ET

### Consistency of Cheeger cuts: Total Variation, Isoperimetry, and Clustering

Clustering unlabeled point clouds is a fundamental problem in machine learning. One classical method for constructing clusters on graph-based data is to solve for Cheeger cuts, which balance between finding clusters that require cutting few graph edges and finding clusters which are similar in size. Although solving for Cheeger cuts on general graphs is very challenging, when the graph is constructed by sampling from a continuum domain one suspects that the problem may be more tractable. I will discuss recent work with Nicolás García Trillos and Matthew Thorpe which establishes quantified convergence rates of discrete Cheeger cuts on graphs over point clouds towards continuum Cheeger sets. This is accomplished by using i) novel estimates of the consistency of total variation energies on such graphs, and ii) recent stability results for classical, continuum isoperimetric problems. This mirrors a guiding principle that the stability of continuum variational problems can often be translated into the point cloud or geometric graph setting.

**Wed Oct 7**

12 noon ET

### Predicting What You Already Know Helps: Provable Self-Supervised Learning

Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks), that do not require labeled data, to learn semantic representations. These pretext tasks are created solely using the input features, such as predicting a missing image patch, recovering the color channels of an image from context, or predicting missing words, yet predicting this known information helps in learning representations effective for downstream prediction tasks. In this talk, we posit a mechanism based on approximate conditional independence to formalize how solving certain pretext tasks can learn representations that provably decrease the sample complexity of downstream supervised tasks. Formally, we quantify how the approximate independence between the components of the pretext task (conditional on the label and latent variables) allows us to learn representations that can solve the downstream task with drastically reduced sample complexity by just training a linear layer on top of the learned representation.

The recording of this talk is available on our youtube channel.

**Wed Sept 30**

12 noon ET

**Sparse Learning with CART**

**Sparse Learning with CART**

Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. In this talk, we explore some of the statistical properties of regression trees constructed with CART. We will see that the training error is governed by the Pearson correlation between the optimal decision stump and response data in each node, which can be bounded by constructing a prior distribution on the split points and solving a nonlinear optimization problem. We leverage this connection between the training error and Pearson correlation to show that CART with cost-complexity pruning achieves an optimal complexity/goodness-of-fit tradeoff when the depth scales with the logarithm of the sample size. Data dependent quantities, which adapt to the dimensionality and latent structure of the regression model, are seen to govern the rates of convergence of the prediction error.

The recording is available here.

**Wed Sept 23**

12 noon ET

**Geometric **I**nsights into **S**pectral **C**lustering by **G**raph Laplacian **E**mbeddings **

**Geometric**I

**nsights into**S

**pectral**C

**lustering by**G

**raph Laplacian**E

**mbeddings**

We present new theoretical results for procedures identifying coarse structures in a given data set by means of appropriate spectral embeddings. We combine ideas from spectral geometry, metastability, optimal transport, and spectral analysis of weighted graph Laplacians to describe the embedding geometry. Our analysis focuses on 1) studying the embedding step of data clustering and 2) comparing the spectra of graph and continuum Laplacians, linking the original spectral clustering problem with a continuum counterpart. This is joint work with Bamdad Hosseini (Caltech) and Nicolas Garcia Trillos (University of Wisconsin-Madison).

The video is available on our youtube channel.

**Wed Sept 16**

12 noon ET

### Stability of Accuracy for Deep Neural Network Classifiers

We examine the stability of accuracy for loss-minimizing training processes that are used for deep neural networks (DNN) and other classifiers. While a classifier is optimized during training by minimizing the loss function, its performance is usually evaluated by the overall accuracy which quantifies the proportion of objects that are well classified. This leads to the question of stability of accuracy: does decreasing loss through training always result in increased accuracy? We formalize the notion of stability and provide examples of instability. We obtain three novel sufficient conditions for stability of training and derive tight bounds on accuracy as loss decreases in the training. The first two conditions apply to the classifier itself by identifying small clusters of misclassified objects as a cause of instability. The third geometric condition identifies flat portions of the training data manifold as sources of instability. The derivation of this condition relies on the propagation of the previous conditions backward through the DNN layers to the data manifold. The multiscale nature of the problem due to several sizes of the small clusters requires that the estimates in the proof have to be compatible with the presence of several scales. Our results do not depend on the algorithm used for training, as long as loss decreases with training.

This is joint work with my advisor L. Berlyand and P.-E. Jabin.

A video of this talk can be found here.

**Wed Sept 9**

12 noon ET

**Provable Algorithms for Sampling Non-log-concave Distributions**

**Provable Algorithms for Sampling Non-log-concave Distributions**

A fundamental problem in Bayesian machine learning is sampling from a probability distribution given access to its log-pdf. Just as the theory of convex optimization is well-developed, so is the theory of sampling from log-concave distributions. Recent years have seen significant progress in understanding optimization beyond convexity. However, despite the ubiquity of non-log-concave distributions in practice, the theory of sampling from non-log-concave distributions is still in its infancy.

I will survey the challenges and progress in this field. A key problem is that the workhorse algorithm for sampling, Langevin Monte Carlo, can take exponential time to mix for multi-modal distributions. Addressing this problem requires bringing in more algorithmic tools and new methods of analysis.

As a case study, we consider the problem of sampling from a simple mixture of log-concave distributions. By combining Langevin diffusion with simulated tempering, we obtain a Markov process that mixes in polynomial time by transitioning between different temperatures. For the analysis, we introduce novel techniques for proving spectral gaps based on Markov process decomposition.

Covers joint work with Rong Ge and Andrej Risteski.

A recording of this talk can be found here.

**Wed Sept 2**

12 noon ET

### Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent

Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion.

Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.

A video of the talk can be found on our youtube channel.

**Wed Aug 26**

12 noon ET

**Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Network****s**

**Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Network**

**s**

In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that its limit is a global minimizer if initialized properly. We also study the "implicit bias" of this algorithm when the objective is the unregularized logistic loss. We finally discuss what these results tell us about the generalization performance. This is based on joint work with Francis Bach.

A recording of the seminar can be found here.

**Wed Aug 19**

12 noon ET

**Dimensionality reduction and matching datasets**

**Dimensionality reduction and matching datasets**

Processing large datasets is a pervasive problem occurring across many different knowledge domains. In this talk we focus on two problems motivated from tasks concerning genetic data: dimensionality reduction and matching. First, given labeled points in a high-dimensional vector space, we seek a projection onto a low dimensional subspace that maintains the classification structure of the data. Taking inspiration from large margin nearest neighbor classification, we introduce SqueezeFit, a semidefinite relaxation of this problem. This relaxation is amenable to theoretical analysis, allowing us to provably recover a planted projection operator from the data. We apply a linear programming version of SqueezeFit to the genetic marker selection problem.

Second, we introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.

A recording of the presentation can be found here.

**Wed Aug 12**

12 noon ET

**A Few Thoughts on Deep Network Approximation**

**A Few Thoughts on Deep Network Approximation**

Deep network approximation is a powerful tool of function approximation via composition. We will present a few new thoughts on deep network approximation from the point of view of scientific computing in practice: given an arbitrary width and depth of neural networks, what is the optimal approximation rate of various function classes? Does the curse of dimensionality exist for generic functions? Can we obtain exponential convergence for generic functions?

**Wed July 29**

12 noon ET

**Tradeoffs between Robustness and Accuracy**

**Tradeoffs between Robustness and Accuracy**

Standard machine learning produces models that are highly accurate on average but that degrade dramatically when the test distribution deviates from the training distribution. While one can train robust models, this often comes at the expense of standard accuracy (on the training distribution). We study this tradeoff in two settings, adversarial examples and minority groups, creating simple examples which highlight generalization issues as a major source of this tradeoff. For adversarial examples, we show that even augmenting with correctly annotated data to promote robustness can produce less accurate models, but we develop a simple method, robust self training, that mitigates this tradeoff using unlabeled data. For minority groups, we show that overparametrization of models can hurt accuracy on the minority groups, though it improves standard accuracy. These results suggest that the "more data" and "bigger models" strategy that works well for the standard setting where train and test distributions are close, need not work on out-of-domain settings.

This is based on joint work with Sang Michael Xie, Shiori Sagawa, Pang Wei Koh, Fanny Yang, John Duchi and Percy Liang.

A recording of the presentation is available here.

**Sat July 25** 12 noon ET

### Thematic Day on the Mean Field Training of Deep Neural Networks

12pm: Roberto I. Oliveira - **A mean-field theory for certain deep neural networks**

1pm: Konstantinos Spiliopoulos - **Mean field limits of neural networks: typical behavior and fluctuations**

2pm: Huy Tuan Pham - **A general framework for the mean field limit of multilayer neural networks **

3pm: Stephan Wojtowytsch - **On the Banach spaces for multi-layer networks and connections to mean field training**

See here for abstracts and recordings of the presentations.

**Wed July 15**

12 noon ET

**On the foundations of computational mathematics, Smale’s 18th problem and the potential limits of AI**

**On the foundations of computational mathematics, Smale’s 18th problem and the potential limits of AI**

There is a profound optimism on the impact of deep learning (DL) and AI in the sciences with Geoffrey Hinton concluding that 'They should stop educating radiologists now'. However, DL has an Achilles heel: it is universaly unstable so that small changes in the initial data can lead to large errors in the final result. This has been documented in a wide variety of applications. Paradoxically, the existence of stable neural networks for these applications is guaranteed by the celebrated Universal Approximation Theorem, however, the stable neural networks are not computed by the current training approaches. We will address this problem and the potential limitations of AI from a foundations point of view. Indeed, the current situation in AI is comparable to the situation in mathematics in the early 20th century, when David Hilbert’s optimism (typically reflected in his 10th problem) suggested no limitations to what mathematics could prove and no restrictions on what computers could compute. Hilbert’s optimism was turned upside down by Goedel and Turing, who established limitations on what mathematics can prove and which problems computers can solve (however, without limiting the impact of mathematics and computer science).

We predict a similar outcome for modern AI and DL, where the limitations of AI (the main topic of Smale’s 18th problem) will be established through the foundations of computational mathematics. We sketch the beginning of such a program by demonstrating how there exist neural networks approximating classical mappings in scientific computing, however, no algorithm (even randomised) can compute such a network to even 1-digit accuracy with probability better than 1/2. We will also show how instability is inherit in the methodology of DL demonstrating that there is no easy remedy, given the current methodology. Finally, we will demonstrate basic examples in inverse problems where there exists (untrained) neural networks that can easily compute a solution to the problem, however, the current DL techniques will need 10^80 data points in the training set to get even 1% success rate.

*A recording of this talk is available **here**. The slides are available **here**. A summary of the zoom chat Q&A during the seminar by **Matthew Colbrook** is available **here**.*

**Wed July 08**

12 noon ET

### Trainability and accuracy of artificial neural networks

The methods and models of machine learning (ML) are rapidly becoming de facto tools for the analysis and interpretation of large data sets. Complex classification tasks such as speech and image recognition, automatic translation, decision making, etc. that were out of reach a decade ago are now routinely performed by computers with a high degree of reliability using (deep) neural networks. These performances suggest that DNNs may approximate high-dimensional functions with controllably small errors, potentially outperforming standard interpolation methods based e.g. on Galerkin truncation or finite elements that have been the workhorses of scientific computing. In support of this prospect, in this talk I will present results about the trainability and accuracy of neural networks, obtained by mapping the parameters of the network to a system of interacting particles relaxing on a potential determined by the loss function. This mapping can be used to prove a dynamical variant of the universal approximation theorem showing that the optimal neural network representation can be attained by (stochastic) gradient descent, with a approximation error scaling as the inverse of the network size. I will also show how these findings can be used to accelerate the training of networks and optimize their architecture, using e.g nonlocal transport involving birth/death processes in parameter space.

*A recording of this talk is available **here**. The sildes are available **here**.*

**Wed July 01**

12 noon ET

### Towards a mathematical understanding of supervised learning: What we know and what we don't know

Two of the biggest puzzles in machine learning are: Why is it so successful and why is it quite fragile?

This talk will present a framework for unraveling these puzzles from the perspective of approximating functions in high dimensions. We will discuss what's known and what's not known about the approximation generalization properties of neural network type of hypothesis space as well as the dynamics and generalization properties of the training process. We will also discuss the relative merits of shallow vs. deep neural network models and suggest ways to formulate more robust machine learning models.

This is joint work with Chao Ma, Stephan Wojtowytsch and Lei Wu.