Towards a mathematical understanding of supervised learning: What we know and what we don't know
Two of the biggest puzzles in machine learning are: Why is it so successful and why is it quite fragile?
This talk will present a framework for unraveling these puzzles from the perspective of approximating functions in high dimensions. We will discuss what's known and what's not known about the approximation generalization properties of neural network type of hypothesis space as well as the dynamics and generalization properties of the training process. We will also discuss the relative merits of shallow vs. deep neural network models and suggest ways to formulate more robust machine learning models.
This is joint work with Chao Ma, Stephan Wojtowytsch and Lei Wu.
Trainability and accuracy of artificial neural networks
The methods and models of machine learning (ML) are rapidly becoming de facto tools for the analysis and interpretation of large data sets. Complex classification tasks such as speech and image recognition, automatic translation, decision making, etc. that were out of reach a decade ago are now routinely performed by computers with a high degree of reliability using (deep) neural networks. These performances suggest that DNNs may approximate high-dimensional functions with controllably small errors, potentially outperforming standard interpolation methods based e.g. on Galerkin truncation or finite elements that have been the workhorses of scientific computing. In support of this prospect, in this talk I will present results about the trainability and accuracy of neural networks, obtained by mapping the parameters of the network to a system of interacting particles relaxing on a potential determined by the loss function. This mapping can be used to prove a dynamical variant of the universal approximation theorem showing that the optimal neural network representation can be attained by (stochastic) gradient descent, with a approximation error scaling as the inverse of the network size. I will also show how these findings can be used to accelerate the training of networks and optimize their architecture, using e.g nonlocal transport involving birth/death processes in parameter space.