# Past Events

Wed Nov. 22

Transformers Meet Image Denoising: Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model’s depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.

Wed Nov. 15

Understanding adversarial robustness via optimal transport perspective: theory and numeric

In this talk, I will present the recent progress of understanding adversarial multiclass classification problems, motivated by the empirical observation of the sensitivity of neural networks by small adversarial attacks. Applying minimax theorem to the adversarial training problem, we can obtain reformulations, which are 'generalized barycenter problem' and a family of multimarginal optimal transport problems. These new theoretical results reveal a rich geometric structure of adversarial training problems in multiclass classification and extend recent results restricted to the binary classification setting. Furthermore, based on this optimal transport reformulation, we can provide two efficient approximate methods which providing a lower bound of the optimal adversarial risk. The basic idea is the truncation of effective interactions between classes: with small adversarial budget, high-order interactions disappear, which implies the redundancy of higher order tensor computations.

Wed Nov. 8

Simple bias in deep learning

Why do neural network models that look so complex usually generalize well? To understand this problem, we find two simple biases in deep learning training. The first is the frequency principle that neural networks learn from low frequency to high frequency. The second is the parameter condensation, which makes the number of effective neurons in large networks far less than the actual number of neurons. These implicit biases all reflect the characteristic that neural networks tend to use simple functions to fit data in the training process, so as to achieve better generalization.

Wed Nov. 1

Adrien Weihs

Asymptotic consistency analysis of graph-based learning and Sobolev seminorm approximations

We consider the asymptotics of fractional Laplacian regularization in semi-supervised learning on graphs. In particular, we characterize the well-posed regime of the algorithm based on the value of the graph-localization parameter and we show that our method is equivalent to minimizing a graph variant of the Sobolev W^{s,2}-seminorm with pointwise constraints. Furthermore, we put this problem into the broader context of approximating Sobolev seminorms by comparing it with the W^{1,p}-case. We discuss how a nonlocal formula for the latter case further helps us to establish rates of convergence in a supervised learning problem. Finally, we demonstrate how general W^{k,p}-seminorms can be discretized on hypergraphs using the nonlocal approximation approach.

Wed Oct. 25

The discrete Schrödinger bridge, and the ensuing chaos Cancelled

E. Schrödinger studied in the 1930s a thought experiment about hot gas in which a cloud of particles evolves in time from an initial distribution to another one, possibly quite different from the initial one. He posed the problem of determining the most likely evolution among the many possible ones, a problem now known as the Schrödinger bridge problem. H. Föllmer later in the 1980s framed the problem as an entropy regularized variational problem. The Schrödinger problem underlies a number of methods in machine learning and data science, while its mathematical and statistical foundations are still being understood. After introducing the problem, several variations, and their connections to regularized optimal transport, we will study the asymptotics of the discrete Schrödinger bridge towards a continuum counterpart for a large number of particles. This will lead to a central limit theorem as well as second order Gaussian chaos limit. A novel chaos decomposition of the discrete Schrödinger bridge by polynomial functions of the pair of empirical distributions as a first and second order Taylor approximations in the space of measures is key to this asymptotic analysis. We will conclude with a brief overview of recent developments and venues for future research. This is joint work with Lang Liu and Soumik Pal <https://arxiv.org/abs/2011.08963>.

Wed Oct. 18

Kernel methods for operator learning

We introduce a kernel-based framework for learning operators between Banach spaces. We show that even with simple kernels, our approach is competitive in terms of cost-accuracy trade-off and either matches or beats the performance of Neural Network methods on a majority of PDE-based benchmarks. Additionally, our framework offers several advantages inherited from kernel methods: simplicity, interpretability, convergence guarantees, a priori error estimates, and Bayesian UQ.

Wed Oct. 4

Bridge Discrete Variables with Back-Propagation and Beyond

Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables and sparse computations. To address the challenge posed by discrete latent variables, we first assess the Straight-Through (ST) heuristic, formally establishing it works as a first-order gradient approximation. Guided by our findings, we propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs. ReinMax does not require Hessian or other second-order derivatives, thus having negligible computation overheads. Extensive experimental results on benchmark datasets demonstrate the superiority of ReinMax over the state of the art.

In addition, the inherent sparse computation of the Mixture-of-Expert (MoE) models poses unique challenges on gradient estimations. To reconciles the dense backpropagation with sparse expert routing, we present SparseMixer, a gradient estimator specifically designed for MoE. Unlike typical MoE training which strategically neglects certain gradient terms for the sake of sparse computation and scalability, SparseMixer provides scalable gradient approximations for these terms, enabling reliable gradient estimation in MoE training. Also grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a different second-order ODE solver, to deliver precise gradient approximations with negligible computational overhead. Applying SparseMixer to Switch Transformer on both pretraining and machine translation tasks, SparseMixer showcases considerable performance gain, accelerating training convergence by up to 2 times.

Wed July 5

A Fast Graph-Based Data Classification Method with Applications to 3D Sensory Data in the Form of Point Clouds

Data classification, where the goal is to divide data into predefined classes, is a fundamental problem in machine learning with many applications, including the classification of 3D sensory data. In this talk, we present a data classification method which can be applied to both semi-supervised and unsupervised learning tasks. The algorithm is derived by unifying complementary region-based and edge-based approaches; a gradient flow of the optimization energy is performed using modified auction dynamics. In addition to being unconditionally stable and efficient, the method is equipped with several properties allowing it to perform accurately even with small labeled training sets, often with considerably fewer labeled training elements compared to competing methods; this is an important advantage due to the scarcity of labeled training data. Some of the properties are: the embedding of data into a weighted similarity graph, the in-depth construction of the weights using, e.g., geometric information, the use of a combination of region-based and edge-based techniques, the incorporation of class size information and integration of random fluctuations. The effectiveness of the method is demonstrated by experiments on classification of 3D point clouds; the algorithm classifies a point cloud of more than a million points in 1-2 minutes.

Wed June 28

12 noon ET

The role of randomness in recurrent neural networks

Recurrent neural networks (RNNs) constitute the simplest machine learning paradigm that is able to handle variable-length data sequences while tracking long-term dependencies and taking into account the temporal order of the received information. In this talk, we will discuss two aspects of randomness in linear RNNs:

1. We examine the functioning of a classifying biological neural network from the perspective of statistical learning theory, modelled, in a simplified setting, as a continuous-time stochastic RNN with identity activation function. We give a generalisation error bound that holds with high probability, thus showing that the empirical risk minimiser is the best-in-class hypothesis. We show that RNNs retain a partial signature of the paths they are fed as the unique information exploited for training and classification tasks. We argue that these RNNs are easy to train and robust and back these observations with numerical experiments. This is based on a joint work with W. Bartolomeus (RWTH Aachen), S. Nestler (FZ Jülich) and H. Rauhut (RWTH Aachen).

2. Reservoir computing (i.e. randomly choosing the connectivity matrix of the RNN) is a paradigm based on the idea that universal approximation properties can be achieved for several dynamical systems without the need to optimise all parameters. This technique simplifies the training of RNNs and has shown exceptional performance in a variety of tasks. Despite this, there is a fundamental lack in the mathematical understanding of the success of such approach. We will discuss the separation capacity of such random reservoirs and how the parameters of the problem (dimension of the reservoir, geometry of the classes of time-series, etc.) impact the performance of the architecture. This is based on an ongoing work.

Wed June 21

12 noon ET

Reproducing kernel Hilbert spaces in the mean field limit

In many applications of machine learning, a large number of variables are present. Motivated by machine learning in the context of interacting particle systems, we consider kernel methods in the situation when the number of input variables goes to infinity. First, given a sequence of kernels with an increasing number of inputs, we establish the existence of a mean field limit kernel. Next, we clarify the relation between the reproducing kernel Hilbert spaces (RKHSs) of the finite-input kernels and the RKHS of the mean field limit kernel. In particular, it turns out that the latter can be interpreted as a mean field limit of the sequence of RKHSs. Finally, we consider kernel methods in the standard statistical learning theory setup, focusing for concreteness on Support Vector Machines (SVMs). We formulate an appropriate notion of mean field limit of learning problems, and show that both empirical as well as infinite-sample solutions of SVMs converge in mean field, as do their corresponding risks. On the one hand, our developments demonstrate that the mean field limit can be "pulled through" a kernel-based statistical learning theory setup, analogously to the situation in mean field optimal control. On the other hand, our investigations add a new angle to the growing body of mean field theory in the context of machine learning.

Wed June 7

12 noon ET

FISTA is an automatic geometrically optimized algorithm for strongly convex functions

This talk is devoted to the study of first order deterministic optimization schemes. This can be seen as a first step to understanding first order stochastic optimization algorithms such as ADAM that are classically used in machine learning.

More specifically, we are interested in the famous FISTA algorithm. We show that FISTA is an automatic geometrically optimized algorithm for functions satisfying a quadratic growth assumption. This explains why FISTA works better than the standard Forward-Backward algorithm (FB) in such a case, although FISTA is known to have a polynomial asymptotical convergence rate while FB is exponential. We provide a simple rule to tune the α parameter within the FISTA algorithm to reach an ε-solution with an optimal number of iterations. These new results highlight the efficiency of FISTA algorithms, and they rely on new non asymptotic bounds for FISTA.

This a joint work with Charles Dossal and Aude Rondepierre (INSA Toulouse).

Wed May 24

12 noon ET

Transfer Learning in Physics-Based Applications with Deep Neural Operators

Traditional machine learning algorithms are learned in isolation, i.e., a predictive model is trained for a single task. In cases where multiple tasks are considered, this learning approach can be computationally prohibitive. Furthermore, when only few and insufficient labeled data are available for a given task, training a model from scratch might lead to overfitting. Transfer learning allows us to leverage information from a model trained on a source domain with sufficient labeled data and transfer it to a different but closely related target domain for which only a small number of data is available. We propose a new TL framework for task-specific learning of partial differential equations (PDEs) under multiple domains that are heterogeneous but subtly correlated, based on the deep operator network (DeepONet). After the training of a source operator regression model, additional given tasks are learned via the fine-tuning of task-specific layers based on a hybrid loss function that allows for the matching of individual target samples while also preserving the global properties of the conditional distribution of target data. The task-specific training is based on the conditional embedding operator theory where conditional target distributions are embedded onto a reproducing kernel Hilbert space. We demonstrate the advantages of our approach for various TL scenarios involving nonlinear PDEs under diverse conditions due to shifts in the geometric domain and model dynamics.

Wed May 17

12 noon ET

Neural Networks Efficiently Learn Low-Dimensional Representations with SGD To Be Rearranged

We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input $x\in \mathbb{R}^d$ is Gaussian and the target $y\in \mathbb{R}$ follows a multiple-index model, with a noisy link function g. We prove that the first-layer weights of the NN converge to the k-dimensional principal subspace spanned by the vectors $u_1, \ldots, u_k$ of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when $k \ll d$. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of $\mathcal{O}(\sqrt{\frac{kd}{T}})$ after T iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form $y = f(\langle u, x \rangle) + \epsilon$ by recovering the principal direction, with a sample complexity linear in d (up to log factors), where f is a monotonic function with at most polynomial growth, and $\epsilon$ is the noise. This is in contrast to the known $d^{\Omega(p)}$ sample requirement to learn any degree p polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.

This is a joint work with Alireza Mousavi-Hosseini (UofT, Vector), Sejun Park (Korea University), Ioannis Mitliagkas (UdeM, Mila), and Murat A. Erdogdu (UofT, Vector).

Wed May 10

12 noon ET

Predicting and improving generalization by measuring loss landscapes and weight matrices

This talk will present several useful metrics obtained from loss landscapes and weight matrices of neural networks. I will show how one can use these metrics to do model selection without access to any training or testing data and improve neural network test-time performance. The talk will start with a brief recap of the “phase plots” often studied in statistical physics, which can be used to taxonomize and diagnose machine learning problems based on the structure of optimization loss landscapes. Then, I will use these phase plots to motivate new ideas of network pruning and ensemble learning. Finally, I will show how to use generalization metrics from the heavy-tail analysis of neural network weight matrices to predict the quality of natural language processing models, e.g., to do model selection on Huggingface Transformers. I will also connect these generalization metrics to prior work in statistical physics and information theory.

Wed May 3

12 noon ET

Energy-driven sampling for PSD-matrix low-rank approximation

The low-rank approximation of large-scale matrices through column sampling is a widespread class of techniques in machine learning and scientific computing. As its name suggests, this approach consists in defining a low-rank approximation of a given matrix from a subset of its the columns, naturally raising questions related to the characterisation of subsets leading to efficient approximations. In practice, the column selection problem is made challenging by its combinatorial nature and the cost inherent to the assessment of the approximation errors. In this talk, we will present a pseudoconvex differentiable relaxation for the column-subset selection problem for positive-semidefinite (PSD) matrix approximation. The considered relaxation is based on the isometric representation of weighted PSD matrices as potential-vectors, significantly reducing the numerical cost related to the exploration of the underlying approximation landscape. We will describe some sampling strategies which leverage the considered framework, and illustrate their behaviour on a series of examples.

Wed Apr 26

12 noon ET

Combining longitudinal and multimodality information for predicting outcomes to cancer therapies

Cancer patients are imaged with multiple imaging modalities as part of routine cancer care. However, the rich information available from the images are not fully exploited to better manage patient care and guide precise targeted treatments. In this talk, I will present how we combine robust formulations of tissue deformations into unsupervised deep learning methods for tracking tumor and tissue changes during treatment and at follow up for early predicting treatment effect and subsequent development of tissue toxicity from systemic and radiation treatments. I will also present some recent work on incorporating unbalanced optimal mass transport with AI automated radiomics analysis methods for quantifying tumor heterogeneity and relating imaging with genomics markers.

Wed Apr 19

12 noon ET

Prediction-Assisted Screening and Discovery with Conformal p-values

Decision making or scientific discovery pipelines such as job hiring and drug discovery often involve multiple stages: before any resource-intensive step, there is often an initial screening that uses predictions from a machine learning model to shortlist a few candidates from a large pool. We study screening procedures that aim to select candidates whose unobserved outcomes exceed user-specified values. We develop a method that wraps around any prediction model to produce a subset of candidates while controlling the proportion of falsely selected units. Building upon the conformal inference framework, our method first constructs p-values that quantify the statistical evidence for large outcomes; it then determines the shortlist by comparing the p-values to a threshold introduced in the multiple testing literature. We will discuss the theoretical underpinning of this method, and some ongoing work on extending this framework to distribution shift scenarios. We demonstrate the empirical performance of our method via simulations, and apply it to job hiring and drug discovery datasets.

Wed Apr 12

12 noon ET

Ensuring Exploration and Exploitation in Graph-Based Active Learning

Semi-supervised learning methods leverage both labeled and unlabeled data to achieve an accurate classification with significantly fewer training points as compared to supervised learning methods that only use labeled data. Simultaneously, the choice of training points can significantly affect classifier performance, especially due to the limited size of the training set of labeled data in semi-supervised learning. Active learning seeks to judiciously select a limited number of query points from the unlabeled data that will inform the machine learning task at hand. These points are then labeled by an expert, or human in the loop, with the aim of significantly improving the classifier performance.

Uncertainty sampling has traditionally been the de facto, simplest acquisition function for active learning in semi-supervised learning. Comparatively cheap to compute and straightforward to interpret, uncertainty sampling has been known to suffer from myopic sampling bias that fails to properly explore the extent of geometric structure of the dataset prior to exploiting learning decision boundaries. As such, most work in active learning for graph-based learning has focused on the design of more intricate acquisition functions that are explorative in nature, though are almost always more costly to compute and therefore do not scale well to larger datasets. We will present two different uncertainty sampling methods for graph-based active learning that (1) have provable guarantees for exploration and exploitation in graph-based semi-supervised learning given an assumption on the clustering structure of the data and (2) are efficient to compute to allow for scalability in downstream applications.

Wed Apr 5

12 noon ET

Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints

Penalty-based regularization is extremely popular in ML. However, this powerful technique can require an expensive trial-and-error process for tuning the penalty coefficient. In this paper, we take sparse training of deep neural networks as a case study to illustrate the advantages of a constrained optimization approach: improved tunability, and a more interpretable hyperparameter. Our proposed technique (i) has a negligible computational overhead, (ii) reliably achieves arbitrary sparsity targets “in one shot” while retaining high accuracy, and (iii) scales successfully to large residual models and datasets.

In this talk, I will also give a brief introduction to Cooper (https://github.com/cooper-org/cooper/), a general-purpose, deep learning-first library for constrained optimization in Pytorch. Cooper was developed as part of the research direction above, and was born out of the need to handle constrained optimization problems for which the loss or constraints may not be "nicely behaved" or "theoretically tractable", as is often the case in DL.

Wed Mar 29

12 noon ET

Images and fibers of the realization map for architectures of feedforward ReLU neural networks

The system of neural architectures is key to the mathematical richness and success of neural networks for at least two reasons. Firstly, the stratification of the space of all functions according to which architecture(s) can realize the function exploits a tension between flexibility and rigidity -- in the sense that any function can be approximated within a sufficiently complicated architecture, but each architecture can represented only a small subset of the space of all functions. Secondly, a neural architecture provides a parameterization of the set of associated functions -- the realization map takes a point in parameter space (a list of weight and biases) and outputs a function. The image of the realization map (for a fixed choice of architecture) is the set of functions that can be represented by that architecture; a fiber of the realization map is a set of parameters that all determine the same function. Neural networks practitioners interact with the space of all functions by selecting an architecture and then using (e.g. during training) the parameterization of function space provided by the realization map on parameter space. Thus, for any fixed choice of architecture, an important question is "What are the image and fibers of the realization map?"

I will present some partial answers to this question for architectures of feedforward ReLU neural networks. In particular, I will present examples of fibers with nontrivial topological structures. I will also describe results about the unbounded rays in and dimension of the images of neural network maps of given architectures, and constraints imposed on the topology of sublevel sets by the architecture. This is based on joint work with Eli Grigsby, David Rolnick, Robert Meyerhoff, and Chenxi Wu

Wed Mar 22

12 noon ET

Topology in machine learning: A case study on persistence images

Wigner described the unreasonable effectiveness of mathematics in the natural sciences: ideas from mathematics are unreasonably effective in advancing applications, and ideas from applications are unreasonably effective in advancing mathematics. We describe a case study on "persistence images", which live at the intersection of topology and machine learning, and which can be thought of as a way to "vectorize" geometry for use in machine learning tasks. I will survey applications arising from materials science, computer vision, and agent-based modeling (modeling a flock of birds or a school of fish), and show how these applications also inspire new topology questions. Joint work with Sofya Chepushtanova, Tegan Emerson, Eric Hanson, Michael Kirby, Francis Motta, Rachel Neville, Chris Peterson, Patrick Shipman, and Lori Ziegelmeier.

Wed Mar 15

12 noon ET

Combining longitudinal and multimodality information for predicting outcomes to cancer therapies To Be Rearranged

Cancer patients are imaged with multiple imaging modalities as part of routine cancer care. However, the rich information available from the images are not fully exploited to better manage patient care and guide precise targeted treatments. In this talk, I will present how we combine robust formulations of tissue deformations into unsupervised deep learning methods for tracking tumor and tissue changes during treatment and at follow up for early predicting treatment effect and subsequent development of tissue toxicity from systemic and radiation treatments. I will also present some recent work on incorporating unbalanced optimal mass transport with AI automated radiomics analysis methods for quantifying tumor heterogeneity and relating imaging with genomics markers.

This seminar will be rearranged.

Wed Mar 8

12 noon ET

Provable Training of Neural Nets With One Layer of Activation

Provable neural training is a fundamental challenge in the field of deep-learning theory – and it largely remains an open question for almost any neural net of practical relevance. The quest for provable convergence for neural training algorithms almost always leads to exciting new questions in mathematics. In this talk, I shall give an overview of three convergence proofs of ours in this territory: (1) In 2016 we had shown the first deterministic algorithm that converges to the exact global minima of any convex loss function for any depth 2 ReLU neural net for any training data in time that is only polynomial in the training data size. (2) In 2020 we showed the first stochastic algorithm that converges to the global minima of a single ReLU gate in linear time (exponentially fast convergence) for realizable data whilst not assuming any specific distribution for the inputs. (3) In 2022, in a first-of-its-kind result we leveraged the theory of SDEs and Villani functions to show that SGD converges to the global minima of an appropriately Frobenius norm regularized squared loss on any depth 2 neural net with tanh or sigmoid activations – for arbitrary width and data. We shall end the talk delineating various open questions in this direction that can possibly be tackled in the near future.

Wed Mar 1

12 noon ET

Using Algebraic Factorizations for Interpretable Learning

Non-negative Matrix Factorization (NMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. In this talk, we will introduce the main concept of NMF, its implementation, and its online and streaming variations. We will showcase how mathematical tools like this can be used for interpretable learning tasks. These applications range from imaging and medicine to forecasting and collaborative filtering. Discussion and questions are welcome.

Wed Feb 22

12 noon ET

Self-Supervised Learning for 3D Shape Analysis

While neural networks have swept the field of computer vision and are replacing classical methods in many areas of image analysis and beyond, extending their power to the domain of 3D shape analysis remains an important open challenge. In my presentation, I will focus on the problems of shape matching, correspondence estimation and shape interpolation and develop suitable deep learning approaches to tackle these challenges. In particular, I will focus on the difficult problem of computing correspondence and interpolation for pairs of shapes from different classes -- say a human and a horse -- where traditional isometry assumptions no longer hold.

Wed Feb 15

12 noon ET

Functional dimension of ReLU Networks

The parameter space for any fixed architecture of neural networks serves as a proxy during training for the associated class of functions - but how faithful is this representation? For any fixed feedforward ReLU network architecture with at least one hidden layer, it is well-known that many different parameter settings can determine the same function. It is less well-known that the degree of this redundancy is inhomogeneous across parameter space. This inhomogeneity should impact the dynamics of training via gradient descent, especially when compared with recent work suggesting that SGD favors flat minima of the loss landscape.

In this talk, I will carefully define the notion of the local functional dimension of a feedforward ReLU network function, discuss the relationship between local functional dimension of a parameter and the geometry of the underlying decomposition of the domain into linear regions, and present some preliminary experimental results suggesting that the probability distribution of the functional dimension at initialization is both interesting and architecture-dependent. Some of this work is joint with Kathryn Lindsey, Rob Meyerhoff, and Chenxi Wu, and some is joint with Kathryn Lindsey and David Rolnick.

Wed Feb 8

12 noon ET

Pruning Deep Neural Networks for Lottery Tickets

Deep learning continues to impress us with breakthroughs across disciplines but comes at severe computational and memory costs that limit the global participation in the development of related technologies. Can we address some of these challenges by finding and training smaller models? The lottery ticket hypothesis has given us hope that this question might be answered by pruning randomly initialized neural networks. In this talk, we will prove a strong version of this hypothesis in realistic settings. Inspired by our theory, we will create a test environment that allows us to highlight current limitations of state-of-the-art algorithms in finding extremely sparse lottery tickets and highlight some opportunities for future progress.

Wed Jan 25

12 noon ET

Learning Low Bending and Low Distortion Manifold Embeddings

Autoencoders, which consist of an encoder and a decoder, are widely used in machine learning for dimension reduction of high-dimensional data.

The encoder embeds the input data manifold into a lower-dimensional latent space, while the decoder represents the inverse map, providing a parametrization of the data manifold by the manifold in latent space. A good regularity and structure of the embedded manifold may substantially simplify further data processing tasks such as cluster analysis or data interpolation.

In this talk we propose and analyze a novel regularization for learning the encoder component of an autoencoder: a loss functional that prefers isometric, extrinsically flat embeddings and allows to train the encoder on its own.

To perform the training it is assumed that for pairs of nearby points on the input manifold their local Riemannian distance and their local Riemannian average can be evaluated. The loss functional is computed via Monte Carlo integration with different sampling strategies for pairs of points on the input manifold. The main theorem identifies a geometric loss functional of the embedding map as the $\Gamma$-limit of the sampling-dependent loss functionals.

Numerical tests, using image data that encodes different explicitly given data manifolds, show that smooth manifold embeddings into latent space are obtained. Due to the promotion of extrinsic flatness, these embeddings are regular enough such that interpolation between not too distant points on the manifold is well approximated by linear interpolation in latent space as one possible postprocessing.

This is joint work with Juliane Braunsmann, Marko Rajkovic, and Benedikt Wirth.

Wed Jan 18

6 AM ET

Principled scaling of deep neural networks

Neural networks have achieved impressive performance in many applications such as image recognition and generation, and speech recognition. State-of-the-art performance is usually achieved via a series of engineered modifications to existing neural architectures and their training procedures. However, a common feature of these systems is their large-scale nature. Indeed, modern neural networks usually contain millions - if not billions - of trainable parameters, and empirical evaluations (generally) support the claim that increasing the scale of neural networks (e.g. width and depth in a multi-layer perceptron) enhances the model performance. However, given a neural network model, it is not straightforward to address the crucial question `how do we scale the network?'. In this talk, I will discuss certain properties of large-scale neural networks and show how we can leverage different mathematical results to build robust networks with empirically confirmed benefits.

# 2022

Wed Dec 14

12 noon ET

Unbiasing Procedures for Scale-invariant Multi-reference Alignment

Recent advances in applications such as cryo-electron microscopy have sparked increased interest in the mathematical analysis of multi-reference alignment (MRA) problems, where the goal is to recover a hidden signal from many noisy observations. The simplest model considers observations of a 1-d hidden signal which have been randomly translated and corrupted by high additive noise. This talk generalizes this classic problem by incorporating random dilations into the data model in addition to random translations and additive noise, and explores multiple approaches to its solution based on translation invariant representations. Random dilations cause large perturbations in the high frequencies, making this a challenging model. When the dilation distribution is unknown, the power spectrum of the hidden signal can be approximated by applying a nonlinear unbiasing procedure to a wavelet-based, translation invariant representation and then solving an optimization problem. When the dilation distribution is known, a more accurate unbiasing procedure can be applied directly to the empirical Fourier invariants to obtain an unbiased estimator of the Fourier invariants of the hidden signal, and the convergence rate of the estimator can be precisely quantified in terms of the sample size and noise levels. Theoretical results are supported by extensive numerical experiments on a wide range of signals. Time permitting, we will also see how these signal processing tools can be applied in the novel context of distribution learning from biased, sparse batches.

Wed Dec 7

12 noon ET

CANCELLED

Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints

This seminar has been postponed.

Penalty-based regularization is extremely popular in ML. However, this powerful technique can require an expensive trial-and-error process for tuning the penalty coefficient. In this paper, we take sparse training of deep neural networks as a case study to illustrate the advantages of a constrained optimization approach: improved tunability, and a more interpretable hyperparameter. Our proposed technique (i) has a negligible computational overhead, (ii) reliably achieves arbitrary sparsity targets “in one shot” while retaining high accuracy, and (iii) scales successfully to large residual models and datasets.

In this talk, I will also give a brief introduction to Cooper (https://github.com/cooper-org/cooper/), a general-purpose, deep learning-first library for constrained optimization in Pytorch. Cooper was developed as part of the research direction above, and was born out of the need to handle constrained optimization problems for which the loss or constraints may not be "nicely behaved" or "theoretically tractable", as is often the case in DL.

Wed Nov 30

12 noon ET

Information theory through kernel methods

Estimating and computing entropies of probability distributions are key computational tasks throughout data science. In many situations, the underlying distributions are only known through the expectation of some feature vectors, which has led to a series of works within kernel methods. In this talk, I will explore the particular situation where the feature vector is a rank-one positive definite matrix, and show how the associated expectations (a covariance matrix) can be used with information divergences from quantum information theory to draw direct links with the classical notions of Shannon entropies.

Wed Nov 16

12 noon ET

The mathematical foundations of deep learning: from rating impossibility to practical existence theorems

Deep learning is having a profound impact on scientific research. Yet, while deep neural networks continue show impressive performance in a wide variety of fields, their mathematical foundations are far from being well established. In this talk, we will present recent developments in this area by illustrating two case studies.

First, motivated by applications in cognitive science, we will present “rating impossibility" theorems. These theorems identify frameworks where neural networks are provably unable to generalize outside the training set while performing the seemingly simple task of learning identity effects, i.e. classifying pairs of objects as identical or not.

Second, motivated by applications in scientific computing, we will illustrate “practical existence" theorems. These theorems combine universal approximation results for deep neural networks with compressed sensing and high-dimensional polynomial approximation theory. As a result, they yield sufficient conditions on the network architecture, the training strategy, and the number of samples able to guarantee accurate approximation of smooth functions of many variables.

Time permitting, we will also discuss work in progress and open questions in this research area.

Wed Nov 9

12 noon ET

Towards a new generation of neural PDE surrogates

Partial differential equations (PDEs) see widespread use in sciences and engineering to describe simulation of physical processes as scalar and vector fields interacting and coevolving over time.

Due to the computationally expensive nature of their standard solution methods, neural PDE surrogates have become an active research topic to accelerate these simulations.

However, the practical utility of training such surrogates is contingent on their ability to model complex multi-scale spatio-temporal phenomena.

In this talk, we address two challenges of neural network PDE surrogates:

(i) How to take into account the relationship between different fields and their internal components, which are often correlated?

We therefore view the time evolution of correlated fields through the lens of multivector fields allows which allows us to introduce new operations that are grounded on the algebraic properties of Clifford algebras.

(ii) How to model multi-scale phenomena and generalize across timescales and equations? We therefore analyze design considerations of Fourier and U-Net based PDE surrogate models, showing promising results on generalization to different PDE parameters and time-scales with a single surrogate model.

We conclude by given an outlook how the proposed methods directly help to tackle imminent challenges in neural PDE surrogate modeling.

Wed Nov 2

12 noon ET

Testing Independence of Exchangeable Random Variables

Given well-shuffled data, can we determine whether the data items are statistically (in)dependent? Formally, we consider the problem of testing whether a set of exchangeable random variables are independent. We will show that this is possible and develop tests that can confidently reject the null hypothesis that data is independent and identically distributed and have high power for (some) exchangeable distributions. We will make no structural assumptions on the underlying sample space. One potential application is in Deep Learning, where data is often scraped from the whole internet, with duplications abound, which can render data non-iid and test-set evaluation prone to give wrong answers.

Wed Oct 26

12 noon ET

Uniform convergence rates for infinity Laplacian equations on graphs

The last few years have seen a deluge of discrete-to-continuum convergence results for various problems in graph-based semi-supervised learning. This theory makes connections between discrete numerical models on one hand and continuum partial differential equations or variational problems on the other hand. Current works use either a Gamma-convergence framework or PDE techniques, which give uniform convergence rates, albeit with more restrictive conditions on the graph bandwidth that often rule out the sparse graphs used in practice. In this talk I will show how to obtain uniform convergence rates for Lipschitz learning, i.e., solutions of the graph infinity Laplacian equation which depend only on the convergence rates of the graph distance function. The proof utilizes the comparison-with-cones characterization and involves a homogenized non-local operator with a much larger bandwidth. This allows us to prove convergence rates for all graph bandwidths strictly above the connectivity threshold. Furthermore, using percolation theory we can even obtain improved convergence rates for graph bandwidths on the connectivity threshold.

This is joint work with Jeff Calder and Tim Roith.

Wed Oct 19

12 noon ET

Adaptivity in Domain Adaptation and Friends

Domain adaptation, transfer, multitask, meta, few-shots, representation, or lifelong learning … these are all important recent directions in ML that all touch at the core of what we might mean by ‘AI’. As these directions all concern learning in heterogeneous and ever-changing environments, they all share a central question: what information a data distribution may have about another, critically, in the context of a given estimation problem, e.g., classification, regression, bandits, etc.

Our understanding of these problems is still rather fledgeling. We plan to present both some recent positive results and also some negative ones. On one hand, recent measures of discrepancy between distributions, fine-tuned to given estimation problems (classification, bandits, etc) offer a more optimistic picture than existing probability metrics (e.g. Wasserstein, TV) or divergences (KL, Renyi, etc) in terms of achievable rates. On the other hand, when considering seemingly simple extensions to choices between multiple datasets (as in multitask), or multiple prediction models (as in Structural Risk Minimization), it turns out that minimax oracle rates are not always adaptively achievable, i.e., using just the available data without side information. These negative results suggest that domain adaptation is more structured in practice than captured by common invariants considered in the literature.

The talk will be based on joint work with collaborators over the last few years, namely, G. Martinet, S. Hanneke, J. Suk.

Wed Oct 12

12 noon ET

Circumventing the curse of dimensionality with deep neural networks

Although the application of deep neural networks to real-world problems has become ubiquitous, the question of why they are so effective has not yet been satisfactorily answered. However, some progress has been made in establishing an underlying mathematical foundation. This talk surveys results on statistical risk bounds of deep neural networks. In particular, we focus on the question of when neural networks bypass the curse of dimensionality. Here we discuss results for vanilla feedforward and convolutional neural networks as well as regression and classification settings. This talk is based on several joint works with Alina Braun, Michael Kohler, Adam Krzyzak, Johannes Schmidt-Hieber and Harro Walk.

Wed Oct 5

12 noon ET

On the Resolution of a Theoretical Question Related to the Nature of Local Training in Federated Learning

We study distributed optimization methods based on the local training (LT) paradigm - achieving improved communication efficiency by performing richer local gradient-based training on the clients before parameter averaging - which is of key importance in federated learning. Looking back at the progress of the field in the last decade, we identify 5 generations of LT methods: 1) heuristic, 2) homogeneous, 3) sublinear, 4) linear, and 5) accelerated. The 5th generation, initiated by the ProxSkip method of Mishchenko et al (2022) and its analysis, is characterized by the first theoretical confirmation that LT is a communication acceleration mechanism. In this talk, I will explain the problem, its solution, and some subsequent work generalizing, extending and improving the ProxSkip method in various ways.

References:

1. Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich and Peter Richtárik. ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally! Proceedings of the 39th International Conference on Machine Learning, 2022

2. Grigory Malinovsky, Kai Yi and Peter Richtárik. Variance reduced ProxSkip: Algorithm, theory and application to federated learning, arXiv:2207.04338, 2022

3. Laurent Condat and Peter Richtárik. RandProx: Primal-dual optimization algorithms with randomized proximal updates, arXiv:2207.12891, 2022

4. Abdurakhmon Sadiev, Dmitry Kovalev and Peter Richtárik. Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact prox, arXiv:2207.03957, 2022

Wed Sept 21

12 noon ET

Implications of the implicit bias in neural networks

When training large neural networks, there are generally many weight combinations that will perfectly fit the training data. However, gradient-based training methods somehow tend to reach those which generalize well, and understanding this "implicit bias" has been a subject of extensive research. In this talk, I will discuss recent works which show several implications of the implicit bias (in homogeneous neural networks trained with the logistic loss): (1) In shallow univariate neural networks the implicit bias provably implies generalization; (2) By using a characterization of the implicit bias, it is possible to reconstruct a significant fraction of the training data from the parameters of a trained neural network, which might shed light on representation learning and memorization in deep learning, but might also have negative potential implications on privacy; (3) In certain settings, the implicit bias provably implies convergence to non-robust networks, i.e., networks which are susceptible to adversarial examples.

Based on joint works with Niv Haim, Itay Safran, Gilad Yehudai, Michal Irani, Jason D. Lee, and Ohad Shamir.

Wed Sept 14

12 noon ET

High-dimensional asymptotics of feature learning in the early phase of NN training

We study the first gradient descent step on the first-layer weights W in a two-layer neural network, where the parameters are randomly initialized, and the training objective is the empirical MSE loss. In the proportional asymptotic limit (where the training set size n, the number of input features d, and the network width N all diverge at the same rate), and under an idealized student-teacher setting, we show that the first gradient update contains a rank-1 "spike", which results in an alignment between the first-layer weights and the linear component of the teacher model f*. To characterize the impact of this alignment, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on W with learning rate \eta. We consider two scalings of the first step learning rate \eta. For small \eta, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random feature model, but cannot defeat the best linear model on the input. Whereas for sufficiently large \eta, we prove that for certain f^*, the same ridge estimator on trained features can go beyond this "linear regime" and outperform a wide range of (fixed) kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.

Wed July 6

12 noon ET

Compressing Variational Bayes

Recent research has shown surprising connections between variational inference (VI) and data compression. This talk will review several theoretical and applied contributions in this domain. I will first show VI enables estimating the information rate-distortion function of a real-world data set, a quantity that could previously not be estimated beyond a few dimensions. I will then show how semi-amortized variational inference can improve neural image compression, how posterior uncertainties enable compression at variable bitrates, and how sequential variational autoencoders can be converted into practical video codecs. Finally, I show how neural video codecs can inspire design choices for extending diffusion models to video data.

Wed June 22

6 am ET

NOTE THE DIFFERENT TIME

SGD Through the Lens of Kolmogorov Complexity

In this talk, we present global convergence guarantees for stochastic gradient descent (SGD) via an entropy compression argument. We do so under two main assumptions: (1. Local progress) The model accuracy improves on average over batches. (2. Models compute simple functions) The function computed by the model has low Kolmogorov complexity. It is sufficient that these assumptions hold only for a tiny fraction of the epochs. Intuitively, our results imply that intermittent local progress of SGD implies global progress. Assumption 2 trivially holds for underparameterized models, hence, our work gives the first convergence guarantee for general, underparameterized models. Furthermore, this is the first result that is completely model agnostic - we don't require the model to have any specific architecture or activation function, it may not even be a neural network.

Wed June 8

12 noon ET

Symmetry in Neural Network Parameters and Non-Linearities

The success of equivariant neural networks has made clear the advantages of considering symmetries of data domains and task functions when designing model architectures. Here, we consider instead symmetries inside the parameter space of neural networks. Group actions on the parameter space which leave the loss invariant can be exploited to accelerate optimization or perform model compression. We study, in particular, a class of models with radial rescaling activations which provide an interesting alternative to pointwise activations and increase symmetry in the parameter space. We prove this class has universal approximation even in the case of fixed width and unbounded domain. We also prove a lossless compression algorithm and show that gradient descent for the compressed model corresponds to a form of projected gradient descent for the original model. In the general case, I will discuss non-linear group actions in the parameter space which can be used to teleport model parameters and increase convergence rate. This talk includes joint work with Iordan Ganev and Twan van Laarhoven as well as Bo Zhao, Nima Dehmamy, and Rose Yu.

Wed May 25

12 noon ET

Beyond Regression: Operators and Extrapolation in Machine Learning

In this talk we first suggest a unification of regression-based machine learning methods, including kernel regression and various types of neural networks. We then consider the limitations of such methods, especially the curse-of-dimensionality, and the various potential solutions that have been proposed including: (1) Barron's existence result, (2) leveraging regularity, and (3) assuming special structure in the data such as independence or redundancy. Finally, we consider operator-learning and extrapolation as emerging directions for machine learning. Operator-learning is the more developed of the two, and we show how learning operators allows intrinsic regularization, uncertainty quantification, and can represent many-to-one and one-to-many mappings. However, extrapolation remains the final frontier in machine learning, and we discuss an emerging approach and the mathematics that may underly it.

Wed May 11

12 noon ET

Smale’s 18th Problem and the Barriers of Deep Learning

Deep learning (DL) has had unprecedented success and is now rapidly entering scientific computing (SC). However, DL suffers from a universal phenomenon: instability, despite universal approximation results that often guarantee the existence of stable and accurate neural networks (NNs). We show the following paradox. There are well-conditioned problems in SC where one can prove the existence of NNs with great approximation qualities, however, there does not exist any algorithm that can train such a NN. For any positive integers n>2 and M, there are cases where simultaneously: (a) no algorithm can train a NN correct to n digits, (b) there exists an algorithm that trains a NN with n-1 correct digits, but any such algorithm needs arbitrarily many training data, (c) there exists an algorithm that trains a NN with n-2 correct digits using no more than M training samples. These results provide basic foundations for Smale's 18th problem and imply a classification theory describing conditions under which (stable) NNs with a given accuracy can be trained. We begin this theory by initiating a unified theory for compressed sensing and DL, leading to sufficient conditions for the existence of training algorithms for stable NNs in inverse problems. We introduce FIRENETs, which we prove and numerically verify are stable. FIRENETs only require O(log(1/eps)) hidden layers for an eps-accurate solution to the inverse problem.

[1] M.J. Colbrook, V. Antun, and A.C. Hansen, “The difficulty of computing stable and accurate neural networks: On the barriers of deep learning and Smale’s 18th problem,” PNAS 119.12 (2022).

[2] M.J. Colbrook, “WARPd: A linearly convergent first-order method for inverse problems with approximate sharpness conditions.” arXiv:2110.12437 (2021).

[3] C.Q. Choi, “Some AI Systems May Be Impossible to Compute,” IEEE Spectrum, https://spectrum.ieee.org/deep-neural-network

Wed May 4

12 noon ET

Understanding and improving generalization in multitask and transfer learning

A broad research question in ML is how to build models that generalize across various tasks. This talk will describe my recent works addressing this question. First, I will talk about some new analyses of multitask learning, which explains negative transfer using recent developments in deep learning theory. Second, I will talk about transfer learning from pretrained deep neural networks. I will describe our approach to avoid overfitting, with provable generalization using Hessians. Lastly, I will briefly talk about a related multigroup learning problem, and a recent contrastive learning approach for solving this problem.

Wed April 27

12 noon ET

Explaining Neural Network Classifiers: Hurdles and Progress

Neural Networks have become the standard tools for high-dimensional decision making, e.g. medical imaging, autonomous driving, playing complex games.

Even in high-stakes areas they generally operate as black-box algorithms without a legible decision process. This has birthed the field of explainable artificial intelligence (XAI). The first step for XAI-methods is to discern between the relevant and irrelevant input components for a decision.

In this talk, we formalise this idea by extending the concept of prime implicants from abductive reasoning to a probabilistic setting. This setting captures what many XAI practitioners intuitively aim for.

We show that finding such small implicants, even approximately, is a computationally hard problem. Furthermore, good solutions depend strongly on the underlying probability distribution. We present strategies to overcome both problems and discuss what challenges still remain.

Wed April 20 12 noon ET

Computational Graph Completion

We present a framework for generating, organizing, and reasoning with computational knowledge. It is motivated by the observation that most problems in Computational Sciences and Engineering (CSE) can be formulated as that of completing (from data) a computational graph (or hypergraph) representing dependencies between functions and variables. Nodes represent variables, and edges represent functions. Functions and variables may be known, unknown, or random. Data comes in the form of observations of distinct values of a finite number of subsets of the variables of the graph (satisfying its functional dependencies). The underlying problem combines a regression problem (approximating unknown functions) with a matrix completion problem (recovering unobserved variables in the data). Replacing unknown functions by Gaussian Processes (GPs) and conditioning on observed data provides a simple but efficient approach to completing such graphs. Since this completion process can be automatized, as one solves $\sqrt{2}$ on a pocket calculator without thinking about it, one could, with the proposed framework, solve a complex CSE problem by drawing a diagram.

Wed April 13 12 noon ET

Towards understanding the role of noise in non-convex machine learning dynamics

It has been empirically shown that the noise induced by the Stochastic Gradient Descent algorithm when training neural networks generally enhances its generalisation performance in comparison to full-batch training (gradient descent). In this talk, we will try to understand how SGD-like noise biases the training dynamics towards specific prediction functions for regression tasks. More precisely, we will first show that the dynamics of SGD over diagonal linear networks converges towards a sparser linear estimator than the one retrieved by GD. Going further, we will also show that adding label noise biases the dynamics towards implicitly solving a Lasso program. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances of stochastic dynamics over deterministic ones, as observed in practice.

Wed April 6

12 noon ET

Deep Learning Approximation of Diffeomorphisms via Linear-Control Systems

In the last years it has been observed that Residual Neural Networks (ResNets) can be interpreted as discretizations of control systems. This can be a valuable tool for a further mathematical understanding of Machine Learning, since it bridges ResNets (and, more generally, Deep Learning) with Control Theory. This parallelism can be useful to study existing architectures and to develop new ones. In particular, in the present seminar we investigate ResNets obtained from linear-control systems. Despite their simplicity, recent theoretical results guarantee that they could be surprisingly expressive. We will focus on the problem of producing an approximation of a diffeomorphism after observing its action on a finite ensemble of points (the dataset). In this framework, the training of the ResNet corresponds to the resolution of a proper Optimal Control problem. Finally, we will see that, owing to the linear dependence of the system in the controls, the training algorithms based on Pontryagin Maximum Principle can be carried out with low computational effort.

# 2021

Wed December 8

12 noon ET

R-Adaptivity, Deep Learning and Optimal Transport

PINNS (physics inspired neural networks) have recently become popular as a means of solving ODEs and PDES by using the tools of deep learning. They have both shown promise for solving some differential equations, and have struggled to solve others. Whilst advertised as being 'mesh free methods' they do rely on the use of collocation points. The accuracy of the numerical solution of PDEs using Finite Element methods depends crucially on the choice of an appropriate mesh. This can be obtained an r-adaptive strategy, which equidistributes the error over the mesh elements based on a-priori/posteriori knowledge of the solution. The core of this talk will describe how r-adaptivity can be useful in the context of Deep Learning. First, we will show that a one-dimensional mesh can be equidistributed by training a feed forward Neural Network. This approach yields better results than other standard numerical methods. We will then explain the training process of Physics-informed Neural Networks (PINNs) for solving Boundary value problems (BVPs) and show numerical results for a reaction-diffusion and convection-dominated equation. It appears that PINNs fail to be trained in the latter case unless the homotropy method is employed. Finally, we will introduce the Deep-Ritz-Network (DRN) for solving the Poisson equation on a non-convex 2-dimensional domain. If the collocation points are uniformly random sampled and fixed for the entire training process, we obtain a solution with poor accuracy. On the contrary, the adoption of an Optimal Transport strategy, which determines the 'optimal' collocation points, results in a more stable training process and a much more accurate solution.

Wed December 1

12 noon ET

Deep Learning in High Dimension: Neural Network Approximation of Analytic Maps of Gaussians

For artificial deep neural networks with ReLU activation,

we prove new expression rate bounds for parametric, analytic functions where the parameter dimension could be infinite.

Approximation rates are in mean square on the unbounded parameter range with respect to product gaussian measure. Approximation rate bounds are free from the CoD, and determined by summability of Wiener-Hermite PC expansion coefficients.

Sufficient conditions for summability are quantified holomorphy on products of strips in the complex domain. Applications comprise DNN expression rate bounds of deep-NNs for response surfaces of elliptic PDEs with log-gaussian random field inputs, and for the posterior densities of the corresponding Bayesian inverse problems.

Variants of proofs which are constructive are outlined.

(joint work with Jakob Zech, University of Heidelberg, Germany, and with Dinh Dung and Nguyen Van Kien, Hanoi, Vietnam)

References:

Wed November 24

12 noon ET

Wasserstein Embeddings in the Deep Learning Era

Computational optimal transport has found many applications in machine learning and, more specifically, deep learning as a fundamental tool to manipulate and compare probability distributions. The Wasserstein distances arising from the optimal transport problem have been of particular interest in recent years. However, a consistent roadblock against the more prevalent use of transport-based methods has been their computational cost. Besides the more well-known ideas for faster computational approaches, including entropy regularization, several fundamental concepts have emerged that enable the integration of transport-based methods as part of the computational graph of a deep neural network. Sliced-Wasserstein distances and the Linear Optimal Transport (LOT) framework are among fundamental concepts well suited for integration into today's deep neural networks. In this talk, we will present the idea of Linear Optimal Transport (otherwise known as the Wasserstein Embedding) and its extension to Sliced-Wasserstein Embeddings and demonstrate their various applications in deep learning with a particular interest in learning from graphs and set-structured data. The talk will be an overview of our recent ICLR 2021 and NeurIPS 2021 publications.

Wed November 10

12 noon ET

Mean field theory in Inverse Problems: from Bayesian inference to overparameterization of networks

Bayesian sampling and neural networks are seemingly two different machine learning areas, but they both deal with many particle systems. In sampling, one evolves a large number of samples (particles) to match a target distribution function, and in optimizing over-parameterized neural networks, one can view neurons particles that feed each other information in the DNN flow. These perspectives allow us to employ mean-field theory, a powerful tool that translates dynamics of many particle system into a partial differential equation (PDE), so rich PDE analysis techniques can be used to understand both the convergence of sampling methods and the zero-loss property of over-parameterization of ResNets. We showcase the use of mean-field theory in these two machine learning areas, and we also invite the audience to brainstorm other possible applications.

Wed November 3

12 noon ET

Joint Reconstruction-Segmentation with Graph PDEs

In most practical image segmentation tasks, the image to be segmented will need to first be reconstructed from indirect, damaged, and/or noisy observations. Traditionally, this reconstruction-segmentation task would be done in sequence: first apply the reconstruction method, and then the segmentation method. Joint reconstruction-segmentation is a method for using segmentation and reconstruction techniques simultaneously, to use information from the segmentation to guide the reconstruction, and vice versa. Past work on this has employed relatively simple segmentation algorithms, such as the Chan–Vese algorithm. In this talk, we will demonstrate how joint reconstruction-segmentation can be done using the graph-PDE-based segmentation techniques developed by Bertozzi & Flenner (2012) and Merkurjev, Kostic, & Bertozzi (2013), with ideas drawn from Budd & van Gennip (2020) and Budd, van Gennip, & Latz (2021).

Wed October 27

12 noon ET

Asymptotic Analysis of Deep Residual Networks

Residual networks (ResNets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature: one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. The scaling regime one ends up with depends on certain features of the network architecture, such as the smoothness of the activation function. These findings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep ResNets and point to an alternative class of differential equations as a better description of the deep network limit. In the case where the scaling limit is a stochastic differential equation, the deep network limit is shown to be described by a system of forward-backward stochastic differential equations. Joint work with: Alain-Sam Cohen (InstaDeep Ltd), Alain Rossier (Oxford), RenYuan Xu (University of Southern California).

Wed October 20

12 noon ET

Kernel Stein Discrepancy Descent

Among dissimilarities between probability distributions, the Kernel Stein Discrepancy (KSD) has received much interest recently. We investigate the properties of its Wasserstein gradient flow to approximate a target probability distribution known up to a normalization constant. This leads to a straightforwardly implementable, deterministic score-based method, named KSD Descent, which uses a set of particles to approximate the target distribution. Remarkably, owing to a tractable loss function, KSD Descent can leverage robust parameter-free optimization schemes such as L-BFGS; this contrasts with other popular particle-based schemes such as the Stein Variational Gradient Descent algorithm. We study the convergence properties of KSD Descent and demonstrate its practical relevance. However, we also highlight failure cases by showing that the algorithm can get stuck in spurious local minima.

Wed October 13

12 noon ET

What Kinds of Functions Do Neural Networks Learn?

Neural nets have made an amazing comeback during the past decade. Their empirical success has been truly phenomenal, but neural nets are poorly understood in a mathematical sense compared to classical methods like splines, kernels, and wavelets. This talk describes recent steps towards a mathematical theory of neural networks comparable to the foundations we have for classical nonparametric methods. Surprisingly, neural nets are minimax optimal in a wide variety of classical univariate function spaces, including those handled by splines and wavelets. In multivariate settings, neural nets are solutions to data-fitting problems cast in entirely new types of multivariate function spaces characterized through total variation (TV) measured in the Radon transform domain. Deep multilayer neural nets naturally represent compositions of functions in these Radon-TV spaces. This theory provides novel explanations for many notable empirical discoveries in deep learning and suggests new approaches to training neural networks.

This is joint work with Rahul Parhi.

Wed October 6

12 noon ET

Approximation properties of two-layer neural networks with values in a Banach space

Approximation properties of infinitely wide neural networks have been studied by several authors in the last few years. New function spaces have been introduced that consist of functions that can be efficiently (i.e., with dimension-independent rates) approximated by neural networks of finite width. Typically, these functions are supposed to act between Euclidean spaces, typically with a high-dimensional input space and a lower-dimensional output space. As neural networks gain popularity in inherently infinite-dimensional settings such as inverse problems and imaging, it becomes necessary to analyse the properties of neural networks as nonlinear operators acting between infinite-dimensional spaces. In this talk, I will present dimension-independent Monte-Carlo rates for neural networks acting between Banach spaces with a partial order (vector lattices), where the ReLU nonlinearity will be interpreted as the lattice operation of taking the positive part.

Wed September 22

12 noon ET

Robust W-GAN-Based Estimation Under Wasserstein Contamination

Robust estimation is an important problem in statistics which aims at providing a reasonable estimator when the data-generating distribution lies within an appropriately defined ball around an uncontaminated distribution. Although minimax rates of estimation have been established in recent years, many existing robust estimators with provably optimal convergence rates are also computationally intractable. In this talk, we study several estimation problems under a Wasserstein contamination model and present computationally tractable estimators motivated by generative adversarial networks (GANs). Specifically, we analyze properties of Wasserstein GAN-based estimators for location estimation, covariance matrix estimation, and linear regression and show that our proposed estimators are minimax optimal in many scenarios. Finally, we present numerical results which demonstrate the effectiveness of our estimators. This is joint work with Zheng Liu (UW-Madison).

Wed September 15

12 noon ET

Bilevel Learning for Inverse Problems

Variational regularization techniques are dominant in the field of inverse problems. A drawback of these techniques is that they are dependent on a number of parameters which have to be set by the user. This issue can be approached by machine learning where we estimate these parameters from data. This is known as "Bilevel Learning" and has been successfully applied to many tasks, some as small-dimensional as learning a regularization parameter, others as high-dimensional as learning a sampling pattern in MRI. While mathematically appealing this strategy leads to a nested optimization problem which is computationally difficult to handle. In this talk we discuss several applications of bilevel learning for imaging as well as new computational approaches. There are quite a few open problems in this relatively recent field of study, some of which I will highlight along the way.

Wed September 8

12 noon ET

Generalized Energy-Based Models

I will describe Generalized Energy Based Models (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model, as in a Generative Adversarial Network), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). In particular, while the energy function is analogous to the GAN critic function, it is not discarded after training. GEBMs are trained by alternating between learning the energy and the base, much like a GAN. Both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via Langevin diffusion-based methods (MALA, UAL, HMC), thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form.

Wed September 1

12 noon ET

Learning Linear Operators

Neural networks have shown great success at learning function approximators between spaces X and Y. In many problems arising in physics it is desirable to learn maps between spaces of functions X and Y; this may be either for the purposes of scientific discovery, or to provide cheap surrogate models which accelerate computations. New ideas are needed to successfully address this learning problem in a scalable, efficient manner. I will highlight recent progress in this area, explaining different approaches taken, and describing numerical results which demonstrate empirical success of the methodology.

I will then describe recent work concerned with this problem in the setting of learning linear operators between Hilbert spaces, distinguishing between compact, bounded and unbounded operators; the work is hence relevant to the question of learning solution operators for inverse problems, as well as forward problems. A Bayesian approach is adopted, and linked to a statistical learning perspective. Sharp rates are obtained for posterior contraction, and related to bounds on the excess risk and generalization gap.

Joint work with Maarten De Hoop, Nik Kovachki and Nick Nelsen.

Wed August 25

12 noon ET

Structure-preserving machine learning for inverse problems

Inverse problems naturally arise in many scientific settings and the study of these problems has been crucial in the development of important technologies such as medical imaging. In inverse problems, the goal is to estimate an underlying ground truth object, typically an image, from corresponding measurements, where the measurements and the ground truth are connected by some forward operator and noise-generating process (both of which are generally assumed to be known). The solution of inverse problems is usually complicated by ill-posedness. Variational regularisation is a well-established approach to overcoming this ill-posedness. In this approach an image is reconstructed from measurements by solving a minimisation problem that trades off data fit with a penalty on unrealistic images. While this approach has proven very successful, it generally requires the parts that make up the optimisation problem to be carefully chosen, and the optimisation problem may require considerable computational effort to solve. There is an active line of research into overcoming these issues using data-driven approaches. In this talk I will discuss ways in which favourable properties of the variational regularisation approach can be combined with a data-driven approach to solving inverse problems. I will conclude by speaking about some interesting directions for future work.

Wed August 18

12 noon ET

Approximating functions, functionals, and operators using deep neural networks for diverse applications

We will review Physics-informed neural network and summarize available theoretical results.. We will also introduce new NNs that learn functionals and nonlinear operators from functions and corresponding responses for system identification. The universal approximation theorem of operators is suggestive of the potential of NNs in learning from scattered data any continuous operator or complex system. We first generalize the theorem to deep neural networks, and subsequently we apply it to design a new composite NN with small generalization error, the deep operator network (DeepONet), consisting of a NN for encoding the discrete input function space (branch net) and another NN for encoding the domain of the output functions (trunk net). We demonstrate that DeepONet can learn various explicit operators, e.g., integrals, Laplace transforms and fractional Laplacians, as well as implicit operators that represent deterministic and stochastic differential equations. More generally, DeepOnet can learn multiscale operators spanning across many scales and trained by diverse sources of data simultaneously.

Wed August 4

12 noon ET

Deep neural networks for inverse problems with pseudodifferential operators: an application to limited-angle tomography

In this talk, I present a novel convolutional neural network (CNN), called $\Psi$DONet, designed for learning pseudodifferential operators in the context of linear inverse problems. The starting point is the Iterative Soft Thresholding Algorithm (ISTA), a well-known algorithm to solve sparsity-promoting minimization problems. I will show that, under rather general assumptions on the forward operator, the unfolded iterations of ISTA can be interpreted as the successive layers of a CNN. In turn this provides fairly general network architectures that, for a specific choice of the parameters involved, allow to reproduce ISTA, or a perturbation of ISTA for which the coefficients of the filters can be bounded. I will use as case study the limited-angle X-ray transform and its application to limited-angle computed tomography (LA-CT). In particular, I will prove that, in the case of LA-CT, the operations of upscaling, downscaling and convolution, which characterize $\Psi$DONet and most deep learning schemes, can be exactly determined by combining the convolutional nature of the limited angle X-ray transform and basic properties defining an orthogonal wavelet system. Finally, I propose two different implementations of $\Psi$DONet which are tested on simulated data from LA geometry, generated from the ellipse data set. This is a joint work with L. Ratti, M. Galinier, M. Lassas, M. Prato and S. Siltanen.

Wed July 28

12 noon ET

PDE Inspired Graph Neural Networks

In this talk we discuss new architectures for graph neural networks. We show that by small modifications of existing architectures we obtain networks with predicted characters that can be tailored for specific tasks. We derive networks that are based on diffusion and on wave propagation on a graph and show their theoretical as well as their practical characters. This understanding enables us to either compare or beat the state of the art for many known benchmarks.

Wed July 21

12 noon ET

Geometric Deep Learning: Grids, Graphs, Groups, Geodesics and Gauges

The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to be beyond reach –such as computer vision, playing Go, or protein folding – are in fact feasible with appropriate computational scale. Remarkably, the essence of deep learning is built from two simple algorithmic principles: first, the notion of representation or feature learning, whereby adapted, often hierarchical, features capture the appropriate notion of regularity for each task, and second, learning by local gradient-descent type methods, typically implemented as backpropagation.

While learning generic functions in high dimensions is a cursed estimation problem, most tasks of interest are not generic, and come with essential pre-defined regularities arising from the underlying low-dimensionality and structure of the physical world. This talk is concerned with exposing these regularities through unified geometric principles that can be applied throughout a wide spectrum of applications.

Such a ‘geometric unification’ endeavour in the spirit of Felix Klein's Erlangen Program serves a dual purpose: on one hand, it provides a common mathematical framework to study the most successful neural network architectures, such as CNNs, RNNs, GNNs, and Transformers. On the other hand, it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented.

Wed July 14

Advances of Momentum in Optimization Algorithms and Neural Architecture Design

We will present a few recent results on leveraging momentum techniques to improve stochastic optimization and neural architecture design.

First, designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. Also, we show the empirical advantage of the momentum-enhanced RNNs over the baseline models.

Second, we will present the recent advances of adaptive momentum in accelerating the stochastic gradient descent (SGD). The adaptive momentum-assisted SGD remarkably improves the deep neural network training in terms of acceleration and improved generalization and significantly reduces the effort for hyperparameter tuning.

Wed June 23

12 noon ET

Variational models and gradient flows for graph clustering

Discrete graph-based variants of the Allen--Cahn and total variation variational models have proven to be successful tools for clustering and classification on graphs. In this talk we will study these models and the gradient flows that are derived from them. We will see deep connections between the various discrete gradient flows as well as between the discrete gradient flows and their continuum relatives.

Wed June 9

12 noon ET

Convergence of Stochastic Gradient Descent for analytic target functions

In this talk we discuss almost sure convergence of Stochastic Gradient Descent in discrete and continuous time for a given twice continuously-differentiable target function F. In a first step we give assumptions on the step-sizes and perturbation size to ensure convergence of the target value F and gradient f=DF assuming that f is locally Hölder-continuous. This result entails convergence of the iterates itself in the case where F does not possess a continuum of critical points.

In a general non-convex setting with F possibly containing a rich set of critical points, convergence of the process itself is sometimes taken for granted, but actually is a non-trivial issue as there are solutions to the gradient flow ODE for smooth target functions that stay in a compact set but do not converge. Using the Lojasiewicz-inequality we give sharp bounds on the step-sizes and the size of the perturbation in order to guarantee convergence of the SGD scheme for analytic target functions. Also, we derive the convergence rate of the function value under the assumptions that F satisfies a particular Lojasiewicz-inequality with exponent in [1/2,1). Finally, we compare the discrete and continuous time results and discuss optimality of the assumptions. This is joint work with Steffen Dereich (WWU Münster).

Wed June 2

12 noon ET

Learning based multi-scale modeling

The behavior of materials involve physics at multiple length and time scales: electronic, atomistic, domains, defects etc. The engineering properties that we observe and exploit in application are a sum total of all these interactions. Multiscale modeling seeks to understand how the physics at the finer scales affect the coarser scales. This can be challenging for two reasons. First, it is computationally expensive due to the need to repeatedly solve the finer scale model. Second, it requires a priori (empirical) knowledge of the aspects of the finer-scale behavior that affect the coarser scale (order parameters, state variables, descriptors, etc.). This is especially challenging in situations where the behavior depends on time. We regard the solution of the finer-scale model as an input-output map (possibly between infinite dimensional spaces), and introduce a a general framework for the data-driven approximation of such maps. The proposed approach is motivated by the recent successes of neural networks and deep learning, in combination with ideas from model reduction. This combination results in a neural network approximation that is computationally inexpensive, independent of the need for a priori knowledge, and can be used directly in the coarser scale calculations. We demonstrate the ideas with examples drawn from first principles study of defects and crystal plasticity study of inelastic impact. The work draws from collaborations with the Caltech PDE-ML group and in particular Burigede Liu, Nikola Kovachki and Ying Shi Teh.

Kaushik Bhattacharya is Howell N. Tyson, Sr., Professor of Mechanics and Professor of Materials Science as well as the Vice-Provost at the California Institute of Technology. He received his B.Tech degree from the Indian Institute of Technology, Madras, India in 1986, his Ph.D from the University of Minnesota in 1991 and his post-doctoral training at the Courant Institute for Mathematical Sciences during 1991-1993. He joined Caltech in 1993. His research concerns the mechanical behavior of materials, and specifically uses theory to guide the development of new materials. He has received the von Kármán Medal of the Society of Industrial and Applied Mathematics (2020), Distinguished Alumni Award of the Indian Institute of Technology, Madras (2019), the Outstanding Achievement Award of the University of Minnesota (2018), the Warner T. Koiter Medal of the American Society of Mechanical Engineering (2015) and the Graduate Student Council Teaching and Mentoring Award at Caltech (2013).

Wed May 26

12 noon ET

The Stein geometry in machine learning: gradient flows, optimal transport and large deviations

Sampling or approximating high-dimensional probability distributions is a key challenge in computational statistics and machine learning. This talk will present connections to gradient flow PDEs, optimal transport and interacting particle systems, focusing on the recently introduced Stein variational gradient descent methodology and some variations. The construction induces a novel geometrical structure on the set of probability distributions related to a positive definite kernel function. We discuss the corresponding geodesic equations, infinitesimal optimal transport maps, as well as large deviation functionals. This is joint work with A. Duncan (Imperial College London), L. Szpruch (University of Edinburgh) and M. Renger (Weierstrass Institute Berlin).

Wed May 19

12 noon ET

Smooth bilevel programming for sparse regularisation

Nonsmooth regularisers are widely used in machine learning for enforcing solution structures (such as the l1 norm for sparsity or the nuclear norm for low rank). State of the art solvers are typically first order methods or coordinate descent methods which handle nonsmoothness by careful smooth approximations and support pruning. In this work, we revisit the approach of iteratively reweighted least squares (IRLS) and show how a simple reparameterization coupled with a bilevel resolution leads to a smooth unconstrained problem. We are therefore able to exploit the machinery of smooth optimisation, such as BFGS, to obtain local superlinear convergence. The result is a highly versatile approach which is able to significantly outperform state of the art methods for a wide range of problems.

The video recording will be available until 20 June 2021.

Wed May 12

12 noon ET

Transport information Bregman divergences

In this talk, we talk about a joint intersection between optimal transport and information geometry. We study Bregman divergences in probability density space embedded with the Wasserstein-2 metric. Several properties and dualities of transport Bregman divergences are provided. In particular, we derive the transport Kullback-Leibler (KL) divergence by a Bregman divergence of negative Boltzmann-Shannon entropy in Wasserstein-2 space. We also derive analytical formulas of transport KL divergence for one-dimensional probability densities and Gaussian families.

Wed May 5

12 noon ET

Adversarial Classification, Optimal Transport, and Geometric Flows

The purpose of this talk is to provide an explicit link between the three topics that form the talk's title, and to introduce a new perspective (more dynamic and geometric) to understand robust classification problems. For concreteness, we will discuss a version of adversarial classification where an adversary is empowered to corrupt data inputs up to some distance \epsilon. We will first describe necessary conditions associated with the optimal classifier subject to such an adversary. Then, using the necessary conditions we derive a geometric evolution equation which can be used to track the change in classification boundaries as \veps varies. This evolution equation may be described as an uncoupled system of differential equations in one dimension, or as a mean curvature type equation in higher dimension. In one dimension we rigorously prove that one can use the initial value problem starting from \veps=0, which is simply the Bayes classifier, to solve for the global minimizer of the adversarial problem. Global optimality is certified using a duality principle between the original adversarial problem and an optimal transport problem. Several open questions and directions for further research will be discussed.

Wed Apr 28

12 noon ET

Barriers to Deploying Deep Learning Models During the COVID-19 Pandemic

A promising application for deep learning models is in assisting clinicians with interpreting X-ray and CT scans, especially when treating respiratory diseases. At the onset of the COVID-19 pandemic, radiologists had to quickly learn how to identify a new disease on chest X-rays and CT scans, and use this information to decide how to allocate scarce resources like ventilators. Researchers around the world developed deep learning models to help clinicians with these decisions, and some models were deployed after only three weeks of testing.

Our group reviewed over 1,000 studies that introduce deep learning models for interpreting chest X-rays or CT scans of COVID-19 patients to determine which models, if any, have the potential to help clinicians during the pandemic. In this talk, I will present our findings and discuss how this pandemic could inform researchers creating deployable deep learning models in healthcare.

This talk is based on the paper [1] (arXiv version). The recording can be found here.

[1] Roberts, M., Driggs, D., Thorpe, M., and the AIX-COVNET Collaboration. "Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans”. Nat. Mach. Intel. 3, 199–217 (2021).

Wed Apr 21

12 noon ET

Machine Learning and Dynamical Systems meet in Reproducing Kernel Hilbert Spaces

Since its inception in the 19th century through the efforts of Poincaré and Lyapunov, the theory of dynamical systems addresses the qualitative behaviour of dynamical systems as understood from models. From this perspective, the modeling of dynamical processes in applications requires a detailed understanding of the processes to be analyzed. This deep understanding leads to a model, which is an approximation of the observed reality and is often expressed by a system of Ordinary/Partial, Underdetermined (Control), Deterministic/Stochastic differential or difference equations. While models are very precise for many processes, for some of the most challenging applications of dynamical systems (such as climate dynamics, brain dynamics, biological systems or the financial markets), the development of such models is notably difficult. On the other hand, the field of machine learning is concerned with algorithms designed to accomplish a certain task, whose performance improves with the input of more data. Applications for machine learning methods include computer vision, stock market analysis, speech recognition, recommender systems and sentiment analysis in social media. The machine learning approach is invaluable in settings where no explicit model is formulated, but measurement data is available. This is frequently the case in many systems of interest, and the development of data-driven technologies is becoming increasingly important in many applications. The intersection of the fields of dynamical systems and machine learning is largely unexplored and the objective of this talk is to show that working in reproducing kernel Hilbert spaces offers tools for a data-based theory of nonlinear dynamical systems. In this talk, we introduce a data-based approach to estimating key quantities which arise in the study of nonlinear autonomous, control and random dynamical systems. Our approach hinges on the observation that much of the existing linear theory may be readily extended to nonlinear systems - with a reasonable expectation of success- once the nonlinear system has been mapped into a high or infinite dimensional Reproducing Kernel Hilbert Space. In particular, we develop computable, non-parametric estimators approximating controllability and observability energies for nonlinear systems. We apply this approach to the problem of model reduction of nonlinear control systems. It is also shown that the controllability energy estimator provides a key means for approximating the invariant measure of an ergodic, stochastically forced nonlinear system. We also show how kernel methods can be used to detect critical transitions for some multi scale dynamical systems. We also use the method of kernel flows to predict some chaotic dynamical systems. Finally, we show how kernel methods can be used to approximate center manifolds, propose a data-based version of the centre manifold theorem and construct Lyapunov functions for nonlinear ODEs. This is joint work with Jake Bouvrie (MIT, USA), Peter Giesl (University of Sussex, UK), Christian Kuehn (TUM, Munich/Germany), Romit Malik (ANNL), Sameh Mohamed (SUTD, Singapore), Houman Owhadi (Caltech), Martin Rasmussen (Imperial College London), Kevin Webster (Imperial College London), Bernard Hasasdonk and Dominik Wittwar (University of Stuttgart), Gabriele Santin (Fondazione Bruno Kessler).

Wed Apr 7

12 noon ET

Discrete Optimization Methods for Group Model Selection in Compressed Sensing

In this talk we study the problem of signal recovery for group models. More precisely for a given set of groups, each containing a small subset of indices, and for given linear sketches of the true signal vector which is known to be group-sparse in the sense that its support is contained in the union of a small number of these groups, we study algorithms which successfully recover the true signal just by the knowledge of its linear sketches. We derive model projection complexity results and algorithms for more general group models than the state-of-the-art. We consider two versions of the classical Iterative Hard Thresholding algorithm (IHT). The classical version iteratively calculates the exact projection of a vector onto the group model, while the approximate version (AM-IHT) uses a head- and a tail-approximation iteratively. We apply both variants to group models and analyze the two cases where the sensing matrix is a Gaussian matrix and a model expander matrix.

slides (updated concerning Head and Tail approximation)

Wed Mar 24

12 noon ET

Random walks and PDEs in graph-based learning

I will discuss some applications of random walks and PDEs in graph-based learning, both for theoretical analysis and algorithm development. Graph-based learning is a field within machine learning that uses similarities between datapoints to create efficient representations of high-dimensional data for tasks like semi-supervised classification, clustering and dimension reduction. There has been considerable interest recently in semi-supervised learning problems with very few labeled examples (e.g., 1 label per class). The widely used Laplacian regularization is ill-posed at low label rates and gives very poor classification results. In the first part of the talk, we will use the random walk interpretation of the graph Laplacian to precisely characterize the lowest label rate at which Laplacian regularized semi-supervised learning is well-posed. At lower label rates, where Laplace learning performs poorly, we will show how our random walk analysis leads to a new algorithm, called Poisson learning, that is probably more stable and informative than Laplace learning. We will conclude with some applications of Poisson learning to image classification and mesh segmentation of broken bone fragments of interest in anthropology.

Wed Mar 17

12 noon ET

Function Approximation via Sparse Random Fourier Features

Random feature methods have been successful in various machine learning tasks, are easy to compute, and come with theoretical accuracy bounds. They serve as an alternative approach to standard neural networks since they can represent similar function spaces without a costly training phase. However, for accuracy, random feature methods require more measurements than trainable parameters, limiting their use for data-scarce applications or problems in scientific machine learning. This paper introduces the sparse random feature method that learns parsimonious random feature models utilizing techniques from compressive sensing. We provide uniform bounds on the approximation error for functions in a reproducing kernel Hilbert space depending on the number of samples and the distribution of features. The error bounds improve with additional structural conditions, such as coordinate sparsity, compact clusters of the spectrum, or rapid spectral decay. We show that the sparse random feature method outperforms shallow networks for well-structured functions and applications to scientific machine learning tasks.

Link available upon request until March 24th.

Wed Mar 10

12 noon ET

Finite Width, Large Depth Neural Networks as Perturbatively Solvable Models

Deep neural networks are often considered to be complicated "black boxes," for which a systematic analysis is not only out of reach but potentially impossible. In this talk, which is based on ongoing joint work with Dan Roberts and Sho Yaida, I will make the opposite claim. Namely, that deep neural networks at initialization are perturbatively solvable models. The perturbative parameter is the width n of the network and we can obtain corrections to all orders in n. Our approach applies to networks at finite width n and large depth L. A key point is an emergent tension between depth and width. Large values of n make neural networks more like Gaussian processes, which are well behaved but incapable of feature learning due to a frozen NTK (at least with standard initialization schemes). Large values of L, in contrast, amplify higher cumulants and change in the NTK, both of which scale with the network aspect ratio L/n.

Wed Mar 3

12 noon ET

Structure preservation and convergence in scientific machine learning

Physics-informed techniques have emerged as a means of incorporating prior knowledge into machine learning. These techniques generally function by minimizing a hybrid loss, regularizing a traditional $\ell_2$ error with a PDE residual. While remarkably effective, these approaches suffer two major shortcomings. Firstly, such neural network (NN) solutions of PDEs generally fail to converge with increasing architecture size. Despite recent work establishing NNs may approximate at least as well as hp-finite element spaces, in practice when training with gradient methods O(1) optimization errors prevent realizing consistency. Secondly, the regularized losses introduce physics via a penalized residual, and it is well known from classical numerical analysis that the approximation space must be designed in tandem with the residual to ensure converge to a given PDE.

We conjecture that the same tools used to design convergent and structure-preserving properties in forward simulation may be used to design scientific ML architectures with similar guarantees. In this talk, we present two current works which address each of these issues. First, we introduce partition of unity networks (POUnets) to develop convergent approximation with deep networks. It has been shown that traditional feed forward networks may approximate by emulating partitions of unity (POU), and then emulating monomials on each partition, ultimately yielding a localized polynomial approximation and associated hp-convergence. Rather than emulating these components, POUnets function by directly incorporating both the POU and polynomials into the architecture. The resulting approximation breaks the curse of dimensionality and admits a fast least-squares optimization strategy. Predictions are competitive with high-order finite element spaces, and provide superior approximation for problems with reduced regularity.

Secondly, we introduce a data-driven exterior calculus (DDEC) which may be used to endow scientific ML architectures with the structure-preserving properties of mimetic PDE discretization. Traditional mimetic methods function by exploiting the exterior calculus structures offered by a mesh to construct discrete operators that exactly mimic the topological properties of continuum operators. We show how graphs may be used as a surrogate for the topology offered by graphs, and present new network architectures which allows "physics-informed" machine learning which exactly preserves conservation, guarantees extraction of well-posed problems, and allows handling of the non-trivial null-spaces occurring in fields such as electromagnetics.

If time permits, we will additionally share some current results applying these tools in challenging data-driven modeling effort at Sandia, related to data-driven shock hydrodynamics in metals and discovery of surrogates for semiconductors in radiation environments.

Wed Feb 17

12 noon ET

Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability

In this paper we revisit some classic problems in classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate η. In a recent work, Diakonikolas, Gouleakis, and Tzamos resolved a long-standing problem by giving the first efficient learner which achieves error η+ϵ for any ϵ>0. However, their algorithm outputs a complicated hypothesis which partitions space into poly(d,1/ϵ) regions. Here, we give a simpler algorithm and in the process resolve a number of outstanding open questions: (1) We give the first proper learner for Massart halfspaces that achieves error η+ϵ. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure which converts an arbitrarily complex classifier into an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show that any SQ algorithm requires super-polynomially many queries to achieve error 𝖮𝖯𝖳+ϵ. Moreover, we study classification under generalized linear models where E[Y|X] = \sigma(w \cdot X) for any odd, monotone, and Lipschitz function \sigma This family includes the previously mentioned halfspace models as a special case, but also includes models like logistic regression where the noise level increases near the decision boundary. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Finally, we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties.

Wed Feb 10

12 noon ET

Convergence and optimality of single-layer neural networks for reinforcement learning

Recent groundbreaking results have established a convergence theory for wide neural networks in the supervised learning setting. Depending on the scaling of parameters at initialization, the (stochastic) gradient descent dynamics of these models converge towards different deterministic limits known as the mean-field and the lazy training regimes.

In this talk, we extend some of these recent results to examples of prototypical algorithms in reinforcement learning: Temporal-Difference (TD) learning and Policy Gradients. In the first case, we prove convergence and optimality of wide neural network training dynamics in the lazy and mean-field regime, respectively. To establish these results, we bypass the lack of gradient structure of the TD learning dynamics by leveraging Lyapunov function techniques in the lazy training regime and sufficient expressivity of the activation function in the mean-field framework. We further show that similar optimality results hold for wide, single layer neural networks trained by entropy-regularized softmax Policy Gradients despite the nonlinear and nonconvex nature of the risk function in this setting.

This is work in collaboration with Jianfeng Lu.

Wed Feb 3

12 noon ET

Total Variation Minimization on Graphs for Semisupervised and Unsupervised Machine Learning

I will speak about a general class of machine learning problems in which data lives on similarity graphs and the goal is to solve a penalized graph min-cut problem. Applications include semi-supervised learning, unsupervised learning, and modularity optimization – originally developed for community detection on networks – but recast as an unsupervised machine learning problem. These problems have a mathematical connection to total variation minimization in Euclidean space and this analogy leads to a natural class of machine learning algorithms that mimic pseudo-spectral methods in nonlinear partial differential equations. The methods are graph analogues of geometric motion – e.g. Motion by mean curvature and the MBO scheme to approximate that dynamics.

Wed Jan 20

12 noon ET

Geometric Methods for Machine Learning and Optimization

Many machine learning applications involve non-Euclidean data, such as graphs, strings or matrices. In such cases, exploiting Riemannian geometry can deliver algorithms that are computationally superior to standard (Euclidean) nonlinear programming approaches. This observation has resulted in an increasing interest in Riemannian methods in the optimization and machine learning community.

In the first part of the talk, we consider the task of learning a robust classifier in hyperbolic space. Such spaces have received a surge of interest for representing large-scale, hierarchical data, due to the fact that they achieve better representation accuracy with lower dimensions. We present the first theoretical guarantees for the (robust) large-margin learning problem in hyperbolic space and discuss conditions under which hyperbolic methods are guaranteed to surpass the performance of their Euclidean counterparts. In the second part, we introduce Riemannian Frank-Wolfe (RFW) methods for constraint optimization on manifolds. Here, the goal of the theoretical analysis is two-fold: We first show that RFW converges at a nonasymptotic sublinear rate, recovering the best-known guarantees for its Euclidean counterpart. Secondly, we discuss how to implement the method efficiently on matrix manifolds. Finally, we consider applications of RFW to the computation of Riemannian centroids and Wasserstein barycenters, which are crucial subroutines in many machine learning methods.

Based on joint work with Suvrit Sra (MIT) and Manzil Zaheer, Ankit Singh Rawat, Aditya Menon and Sanjiv Kumar (all Google Research).

Wed Jan 13

12 noon ET

Machine Learned Regularization for Solving Inverse Problems

Inverse problems are about the reconstruction of an unknown physical quantity from indirect measurements. Most inverse problems of interest are ill-posed and require appropriate mathematical treatment for recovering meaningful solutions. Regularization is one of the main mechanisms to turn inverse problems into well-posed ones by adding prior information about the unknown quantity to the problem, often in the form of assumed regularity of solutions. Classically, such regularization approaches are handcrafted. Examples include Tikhonov regularization, the total variation and several sparsity-promoting regularizers such as the L1 norm of Wavelet coefficients of the solution. While such handcrafted approaches deliver mathematically and computationally robust solutions to inverse problems, providing a universal approach to their solution, they are also limited by our ability to model solution properties and to realise these regularization approaches computationally.

Recently, a new paradigm has been introduced to the regularization of inverse problems, which derives regularization approaches for inverse problems in a data driven way. Here, regularization is not mathematically modelled in the classical sense, but modelled by highly over-parametrised models, typically deep neural networks, that are adapted to the inverse problems at hand by appropriately selected (and usually plenty of) training data.

In this talk, I will review some machine learning based regularization techniques, present some work on unsupervised and deeply learned convex regularisers and their application to image reconstruction from tomographic and blurred measurements, and finish by discussing some open mathematical problems.

# 2020

Wed Dec 16

12 noon ET

The dual of the margin: improved analyses and rates for gradient descent’s implicit bias

The implicit bias of gradient descent, and specifically its margin maximization properties, have arisen as a promising explanation for the good generalization of deep networks. The purpose of this talk is to demonstrate the effectiveness of a dual problem to smoothed margin maximization. Concretely, this talk will develop this dual, as well as a variety of consequences in linear and nonlinear settings.

In the linear case, this dual perspective firstly will yield fast 1/t rates for margin maximization and implicit bias. This is faster than any prior first-order hard-margin SVM solver, which achieves 1/sqrt{t} at best.

Secondly, the dual analysis also allows a characterization of the implicit bias, even outside the standard setting of exponentially-tailed losses; in this sense, it is gradient descent, and not a particular loss structure which leads to implicit bias.

In the nonlinear case, duality will enable the proof of a gradient alignment property: asymptotically, the parameters and their gradients become colinear. Although abstract, this property in turn implies various existing and new margin maximization results.

Joint work with Matus Telgarsky.

Wed Dec 09

12 noon ET

A PDE Interpretation of Prediction with Expert Advice

We study the problem of prediction of binary sequences with expert advice in the online setting, which is a classic example of online machine learning. We interpret the binary sequence as the price history of a stock, and view the predictor as an investor, which converts the problem into a stock prediction problem. In this framework, an investor, who predicts the daily movements of a stock, and an adversarial market, who controls the stock, play against each other over N turns. The investor combines the predictions of n ≥ 2 experts in order to make a decision about how much to invest at each turn, and aims to minimize their regret with respect to the best-performing expert at the end of the game. We consider the problem with history-dependent experts, in which each expert uses the previous d days of history of the market in making their predictions. The prediction problem is played (in part) over a discrete graph called the d dimensional de Bruijn graph.

We focus on an appropriate continuum limit and using methods from optimal control, graph theory, and partial differential equations, we discuss strategies for the investor and the adversarial market. We prove that the value function for this game, rescaled appropriately, converges as N → ∞ at a rate of O(N-1/2) (for C4 payoff functions) to the viscosity solution of a nonlinear degenerate parabolic PDE. It can be understood as the Hamilton-Jacobi-Issacs equation for the two-person game. As a result, we are able to deduce asymptotically optimal strategies for the investor.

This is joint work with Robert Kohn and Jeff Calder.

Wed Nov 25

12 noon ET

Neural network performance for classification problems with boundaries of Barron class

We study classification problems in which the distances between the different classes are not necessarily positive, but for which the boundaries between the classes are well-behaved. More precisely, we assume these boundaries to be locally described by graphs of functions of Barron-class. ReLU neural networks can approximate and estimate classification functions of this type with rates independent of the ambient dimension. More formally, three-layer networks with $N$ neurons can approximate such functions with $L^1$-error bounded by $O(N^{-1/2})$. Furthermore, given $m$ training samples from such a function, and using ReLU networks of a suitable architecture as the hypothesis space, any empirical risk minimizer has generalization error bounded by $O(m^{-1/4})$. All implied constants depend only polynomially on the input dimension. We also discuss the optimality of these rates. Our results mostly rely on the "Fourier-analytic" Barron spaces that consist of functions with finite first Fourier moment. But since several different function spaces have been dubbed "Barron spaces'' in the recent literature, we discuss how these spaces relate to each other. We will see that they differ more than the existing literature suggests.

Wed Nov 18

12 noon ET

Conditional Sampling with Monotone GANs: Modifying Generative Models to Solve Inverse Problems

Generative models such as GANs, VAEs and Normalizing Flows have been very successful in the unsupervised learning task of generating samples from a high-dimensional probability distribution from its empirical approximation. However, the task of conditioning a high-dimensional distribution from limited empirical samples has attracted less attention in the literature. In this talk we will discuss some ideas in this direction by viewing generative modelling as a measure transport problem. In particular, we present a simple recipe using block-triangular maps and monotonicity constraints that enables standard models such as vanilla GAN to perform conditional sampling. We demonstrate the effectiveness of our method on various examples ranging from synthetic test sets to image in-painting and function space inference in porous medium flow.

The video of the talk is available here.

Wed Nov 11

12 noon ET

A Dynamical Central Limit Theorem for Shallow Neural Networks

Recent theoretical works have characterized the dynamics of wide shallow neural networks trained via gradient descent in an asymptotic mean-field limit when the number of parameters tends towards infinity. At initialization, the random sampling of the parameters leads to a fluctuation from the mean-field limit dictated by the classical Central Limit Theorem (CLT). However, as gradient descent induces correlations among the parameters, the question of how the deviation evolves during training remains unclear. Here, we prove that the deviation from the mean-field limit scaled by the width, in the width-asymptotic limit, remains bounded throughout training. The upper bound is given by a Monte-Carlo type resampling error, which does not depend explicitly on the dimension. It then motivates the use of the 2-norm of the underlying measure as a regularization term, which is related to the generalization error as well via the theory of variation-norm function spaces. Moreover, if the mean-field dynamics converges to a measure that interpolates the training data, we prove that the asymptotic deviation eventually vanishes in the CLT scaling. We also complement these results with numerical experiments.

Wed Nov 04

12 noon ET

Analysis of Stochastic Gradient Descent in Continuous Time

Stochastic gradient descent is an optimisation method that combines classical gradient descent with random subsampling within the target functional. In this work, we introduce the stochastic gradient process as a continuous-time representation of stochastic gradient descent. The stochastic gradient process is a dynamical system that is coupled with a continuous-time Markov process living on a finite state space. The dynamical system - a gradient flow - represents the gradient descent part, the process on the finite state space represents the random subsampling. Processes of this type are, for instance, used to model clonal populations in fluctuating environments. After introducing it, we study theoretical properties of the stochastic gradient process. We show that it converges weakly to the gradient flow with respect to the full target function, as the learning rate approaches zero. Moreover, we give assumptions under which the stochastic gradient process is exponentially ergodic in the Wasserstein sense. We then additionally assume that the single target functions are strongly convex and the learning rate goes to zero sufficiently slowly. In this case, the process converges with exponential rate to a distribution arbitrarily close to the point mass concentrated in the global minimum of the full target function. We conclude with a discussion of discretisation strategies for the stochastic gradient process and illustrate our concepts in numerical experiments.

Wed Oct 28

12 noon ET

How Important is the Train-Validation Split in Meta-Learning?

Meta-learning aims to perform fast adaptation on a new task through learning a “prior” from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct non-splitting method, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful on the linear centroid meta-learning problem, in the asymptotic setting where the number of tasks goes to infinity. We show that the splitting method converges to the optimal prior as expected, whereas the non-splitting method does not in general without structural assumptions on the data. In contrast, if the data are generated from linear models (the realizable regime), we show that both the splitting and non-splitting methods converge to the optimal prior. Further, perhaps surprisingly, our main result shows that the non-splitting method achieves a strictly better asymptotic excess risk under this data distribution, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that data splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the non-splitting method can indeed outperform the splitting method, on both simulations and real meta-learning tasks.

Wed Oct 21

12 noon ET

### Consistency of Cheeger cuts: Total Variation, Isoperimetry, and Clustering

Clustering unlabeled point clouds is a fundamental problem in machine learning. One classical method for constructing clusters on graph-based data is to solve for Cheeger cuts, which balance between finding clusters that require cutting few graph edges and finding clusters which are similar in size. Although solving for Cheeger cuts on general graphs is very challenging, when the graph is constructed by sampling from a continuum domain one suspects that the problem may be more tractable. I will discuss recent work with Nicolás García Trillos and Matthew Thorpe which establishes quantified convergence rates of discrete Cheeger cuts on graphs over point clouds towards continuum Cheeger sets. This is accomplished by using i) novel estimates of the consistency of total variation energies on such graphs, and ii) recent stability results for classical, continuum isoperimetric problems. This mirrors a guiding principle that the stability of continuum variational problems can often be translated into the point cloud or geometric graph setting.

Wed Oct 7

12 noon ET

### Predicting What You Already Know Helps: Provable Self-Supervised Learning

Self-supervised representation learning solves auxiliary prediction tasks (known as pretext tasks), that do not require labeled data, to learn semantic representations. These pretext tasks are created solely using the input features, such as predicting a missing image patch, recovering the color channels of an image from context, or predicting missing words, yet predicting this known information helps in learning representations effective for downstream prediction tasks. In this talk, we posit a mechanism based on approximate conditional independence to formalize how solving certain pretext tasks can learn representations that provably decrease the sample complexity of downstream supervised tasks. Formally, we quantify how the approximate independence between the components of the pretext task (conditional on the label and latent variables) allows us to learn representations that can solve the downstream task with drastically reduced sample complexity by just training a linear layer on top of the learned representation.

The recording of this talk is available on our youtube channel.

Wed Sept 30

12 noon ET

### Sparse Learning with CART

Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. In this talk, we explore some of the statistical properties of regression trees constructed with CART. We will see that the training error is governed by the Pearson correlation between the optimal decision stump and response data in each node, which can be bounded by constructing a prior distribution on the split points and solving a nonlinear optimization problem. We leverage this connection between the training error and Pearson correlation to show that CART with cost-complexity pruning achieves an optimal complexity/goodness-of-fit tradeoff when the depth scales with the logarithm of the sample size. Data dependent quantities, which adapt to the dimensionality and latent structure of the regression model, are seen to govern the rates of convergence of the prediction error.

The recording is available here.

Wed Sept 23

12 noon ET

### Geometric Insights into Spectral Clustering by Graph Laplacian Embeddings

We present new theoretical results for procedures identifying coarse structures in a given data set by means of appropriate spectral embeddings. We combine ideas from spectral geometry, metastability, optimal transport, and spectral analysis of weighted graph Laplacians to describe the embedding geometry. Our analysis focuses on 1) studying the embedding step of data clustering and 2) comparing the spectra of graph and continuum Laplacians, linking the original spectral clustering problem with a continuum counterpart. This is joint work with Bamdad Hosseini (Caltech) and Nicolas Garcia Trillos (University of Wisconsin-Madison).

The video is available on our youtube channel.

Wed Sept 16

12 noon ET

### Stability of Accuracy for Deep Neural Network Classifiers

We examine the stability of accuracy for loss-minimizing training processes that are used for deep neural networks (DNN) and other classifiers. While a classifier is optimized during training by minimizing the loss function, its performance is usually evaluated by the overall accuracy which quantifies the proportion of objects that are well classified. This leads to the question of stability of accuracy: does decreasing loss through training always result in increased accuracy? We formalize the notion of stability and provide examples of instability. We obtain three novel sufficient conditions for stability of training and derive tight bounds on accuracy as loss decreases in the training. The first two conditions apply to the classifier itself by identifying small clusters of misclassified objects as a cause of instability. The third geometric condition identifies flat portions of the training data manifold as sources of instability. The derivation of this condition relies on the propagation of the previous conditions backward through the DNN layers to the data manifold. The multiscale nature of the problem due to several sizes of the small clusters requires that the estimates in the proof have to be compatible with the presence of several scales. Our results do not depend on the algorithm used for training, as long as loss decreases with training.

This is joint work with my advisor L. Berlyand and P.-E. Jabin.

A video of this talk can be found here.

Wed Sept 9

12 noon ET

### Provable Algorithms for Sampling Non-log-concave Distributions

A fundamental problem in Bayesian machine learning is sampling from a probability distribution given access to its log-pdf. Just as the theory of convex optimization is well-developed, so is the theory of sampling from log-concave distributions. Recent years have seen significant progress in understanding optimization beyond convexity. However, despite the ubiquity of non-log-concave distributions in practice, the theory of sampling from non-log-concave distributions is still in its infancy.

I will survey the challenges and progress in this field. A key problem is that the workhorse algorithm for sampling, Langevin Monte Carlo, can take exponential time to mix for multi-modal distributions. Addressing this problem requires bringing in more algorithmic tools and new methods of analysis.

As a case study, we consider the problem of sampling from a simple mixture of log-concave distributions. By combining Langevin diffusion with simulated tempering, we obtain a Markov process that mixes in polynomial time by transitioning between different temperatures. For the analysis, we introduce novel techniques for proving spectral gaps based on Markov process decomposition.

Covers joint work with Rong Ge and Andrej Risteski.

A recording of this talk can be found here.

Wed Sept 2

12 noon ET

### Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent

Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion.

Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.

A video of the talk can be found on our youtube channel.

Wed Aug 26

12 noon ET

### Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Networks

In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that its limit is a global minimizer if initialized properly. We also study the "implicit bias" of this algorithm when the objective is the unregularized logistic loss. We finally discuss what these results tell us about the generalization performance. This is based on joint work with Francis Bach.

A recording of the seminar can be found here.

Wed Aug 19

12 noon ET

### Dimensionality reduction and matching datasets

Processing large datasets is a pervasive problem occurring across many different knowledge domains. In this talk we focus on two problems motivated from tasks concerning genetic data: dimensionality reduction and matching. First, given labeled points in a high-dimensional vector space, we seek a projection onto a low dimensional subspace that maintains the classification structure of the data. Taking inspiration from large margin nearest neighbor classification, we introduce SqueezeFit, a semidefinite relaxation of this problem. This relaxation is amenable to theoretical analysis, allowing us to provably recover a planted projection operator from the data. We apply a linear programming version of SqueezeFit to the genetic marker selection problem.

Second, we introduce and study MREC, a recursive decomposition algorithm for computing matchings between data sets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire data set. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large data sets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.

A recording of the presentation can be found here.

Wed Aug 12

12 noon ET

### A Few Thoughts on Deep Network Approximation

Deep network approximation is a powerful tool of function approximation via composition. We will present a few new thoughts on deep network approximation from the point of view of scientific computing in practice: given an arbitrary width and depth of neural networks, what is the optimal approximation rate of various function classes? Does the curse of dimensionality exist for generic functions? Can we obtain exponential convergence for generic functions?

Wed July 29

12 noon ET

### Tradeoffs between Robustness and Accuracy

Standard machine learning produces models that are highly accurate on average but that degrade dramatically when the test distribution deviates from the training distribution. While one can train robust models, this often comes at the expense of standard accuracy (on the training distribution). We study this tradeoff in two settings, adversarial examples and minority groups, creating simple examples which highlight generalization issues as a major source of this tradeoff. For adversarial examples, we show that even augmenting with correctly annotated data to promote robustness can produce less accurate models, but we develop a simple method, robust self training, that mitigates this tradeoff using unlabeled data. For minority groups, we show that overparametrization of models can hurt accuracy on the minority groups, though it improves standard accuracy. These results suggest that the "more data" and "bigger models" strategy that works well for the standard setting where train and test distributions are close, need not work on out-of-domain settings.

This is based on joint work with Sang Michael Xie, Shiori Sagawa, Pang Wei Koh, Fanny Yang, John Duchi and Percy Liang.

A recording of the presentation is available here.

Sat July 25

12 noon ET

### Thematic Day on the Mean Field Training of Deep Neural Networks

12pm: Roberto I. Oliveira - A mean-field theory for certain deep neural networks

1pm: Konstantinos Spiliopoulos - Mean field limits of neural networks: typical behavior and fluctuations

2pm: Huy Tuan Pham - A general framework for the mean field limit of multilayer neural networks

3pm: Stephan Wojtowytsch - On the Banach spaces for multi-layer networks and connections to mean field training

See here for abstracts and recordings of the presentations.

Wed July 15

12 noon ET

### On the foundations of computational mathematics, Smale’s 18th problem and the potential limits of AI

There is a profound optimism on the impact of deep learning (DL) and AI in the sciences with Geoffrey Hinton concluding that 'They should stop educating radiologists now'. However, DL has an Achilles heel: it is universaly unstable so that small changes in the initial data can lead to large errors in the final result. This has been documented in a wide variety of applications. Paradoxically, the existence of stable neural networks for these applications is guaranteed by the celebrated Universal Approximation Theorem, however, the stable neural networks are not computed by the current training approaches. We will address this problem and the potential limitations of AI from a foundations point of view. Indeed, the current situation in AI is comparable to the situation in mathematics in the early 20th century, when David Hilbert’s optimism (typically reflected in his 10th problem) suggested no limitations to what mathematics could prove and no restrictions on what computers could compute. Hilbert’s optimism was turned upside down by Goedel and Turing, who established limitations on what mathematics can prove and which problems computers can solve (however, without limiting the impact of mathematics and computer science).

We predict a similar outcome for modern AI and DL, where the limitations of AI (the main topic of Smale’s 18th problem) will be established through the foundations of computational mathematics. We sketch the beginning of such a program by demonstrating how there exist neural networks approximating classical mappings in scientific computing, however, no algorithm (even randomised) can compute such a network to even 1-digit accuracy with probability better than 1/2. We will also show how instability is inherit in the methodology of DL demonstrating that there is no easy remedy, given the current methodology. Finally, we will demonstrate basic examples in inverse problems where there exists (untrained) neural networks that can easily compute a solution to the problem, however, the current DL techniques will need 10^80 data points in the training set to get even 1% success rate.

A recording of this talk is available here. The slides are available here. A summary of the zoom chat Q&A during the seminar by Matthew Colbrook is available here.

Wed July 08

12 noon ET

### Trainability and accuracy of artificial neural networks

The methods and models of machine learning (ML) are rapidly becoming de facto tools for the analysis and interpretation of large data sets. Complex classification tasks such as speech and image recognition, automatic translation, decision making, etc. that were out of reach a decade ago are now routinely performed by computers with a high degree of reliability using (deep) neural networks. These performances suggest that DNNs may approximate high-dimensional functions with controllably small errors, potentially outperforming standard interpolation methods based e.g. on Galerkin truncation or finite elements that have been the workhorses of scientific computing. In support of this prospect, in this talk I will present results about the trainability and accuracy of neural networks, obtained by mapping the parameters of the network to a system of interacting particles relaxing on a potential determined by the loss function. This mapping can be used to prove a dynamical variant of the universal approximation theorem showing that the optimal neural network representation can be attained by (stochastic) gradient descent, with a approximation error scaling as the inverse of the network size. I will also show how these findings can be used to accelerate the training of networks and optimize their architecture, using e.g nonlocal transport involving birth/death processes in parameter space.

A recording of this talk is available here. The sildes are available here.

Wed July 01

12 noon ET

### Towards a mathematical understanding of supervised learning: What we know and what we don't know

Two of the biggest puzzles in machine learning are: Why is it so successful and why is it quite fragile?

This talk will present a framework for unraveling these puzzles from the perspective of approximating functions in high dimensions. We will discuss what's known and what's not known about the approximation generalization properties of neural network type of hypothesis space as well as the dynamics and generalization properties of the training process. We will also discuss the relative merits of shallow vs. deep neural network models and suggest ways to formulate more robust machine learning models.

This is joint work with Chao Ma, Stephan Wojtowytsch and Lei Wu.