Abstract : Deep neural networks have recently been used to design innovative, and arguably revolutionary, methods for solving a large number of challenging problems from science and engineering which are modeled by differential equations. Conversely, differential equations provide an important set of tools for understanding methods based upon neural networks. This minisymposium is dedicated to recent progress at the interface between neural networks and differential equations, including topics such as the theoretical convergence analysis and computation of neural networks for solving high dimensional PDEs in addition to the analysis, training, and design of neural networks using perspectives from the study of differential equations.
Organizer(s) : Yulong Lu, Jonathan Siegel, Stephan Wojtowytsch
[01481] Operator learning and nonlinear model reduction by deep neural networks
Format : Talk at Waseda University
Author(s) :
Wenjing Liao (Georgia Institute of Technology)
Abstract : This talk focuses on the problem of operator learning to learn an operator between function spaces from data. Model reduction plays a significant role, to reduce the data dimension and the problem size. We consider a neural network architecture to utilize low-dimensional nonlinear structures in model reduction, and prove an upper bound of the generalization error. Our theory shows that, the sample complexity depends on the intrinsic dimension of the model.
[01838] A deep learning framework for geodesics under spherical Wasserstein-Fisher-Rao metric
Format : Talk at Waseda University
Author(s) :
Yang Jing (Shanghai Jiao Tong University)
Abstract : Wasserstein-Fisher-Rao (WFR) distance is a family of metrics to gauge the discrepancy of two measures. Compared to the case for Wasserstein distance, the understanding of geodesics under spherical WFR is less clear . We develop a deep learning framework to compute geodesics under spherical WFR metric, and the learned geodesics can be adopted to generate weighted samples. Our framework can be beneficial for applications with given weighted samples, especially in Bayesian inference.
[01927] Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks
Format : Talk at Waseda University
Author(s) :
Marco Mondelli (Institute of Science and Technology Austria (ISTA))
Abstract : We consider a two-layer ReLU network trained via SGD for univariate regression. By connecting the SGD dynamics to the solution of a gradient flow, we show that SGD is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of "knot" points - i.e., points where the tangent of the ReLU network estimator changes - between two consecutive training inputs is at most three.
[02274] On the Mean-field Theory of Neural Network Training
Format : Talk at Waseda University
Author(s) :
Zhengdao Chen (Google)
Abstract : To understand the training of wide neural networks, prior studies have considered the infinite-width mean-field limit of two-layer neural networks, whose training dynamics is characterized as the Wasserstein gradient flow of a probability distribution. In this talk, we will present a mean-field theory for a type of three-layer neural networks whose first-layer weights are random and fixed, through which we prove a linear-rate convergence guarantee and generalization bounds.
[01535] Machine learning for/with Differential Equation Modeling: Statistics and Computation
Format : Online Talk on Zoom
Author(s) :
Yiping Lu (Stanford University)
Abstract : Massive data collection and computational capabilities have enabled data-driven scientific discoveries and control of engineering systems. However, there are still several questions that should be answered to understand the fundamental limits of just how much can be discovered with data and what is the value of additional information. For example, 1) How can we learn a physics law or economic principle purely from data? 2) How hard is this task, both computationally and statistically? 3) What’s the impact on hardness when we add further information (e.g., adding data, model information)? I’ll answer these three questions in this talk in two learning tasks. A key insight in both two cases is that using direct plug-in estimators can result in statistically suboptimal inference.
The first learning task I’ll discuss is linear operator learning/functional data analysis, which has wide applications in causal inference, time series modeling, and conditional probability learning. We build the first min-max lower bound for this problem. The min-max rate has a particular structure where the more challenging parts of the input and output spaces determine the hardness of learning a linear operator. Our analysis also shows that an intuitive discretization of the infinite-dimensional operator could lead to a sub-optimal statistical learning rate. Then, I’ll discuss how, by suitably trading-off bias and variance, we can construct an estimator with an optimal learning rate for learning a linear operator between infinite dimension spaces. We also illustrate how this theory can inspire a multilevel machine-learning algorithm of potential practical use.
For the second learning task, we focus on variational formulations for differential equation models. We discuss a prototypical Poisson equation. We provide a minimax lower bound for this problem. Based on the lower bounds, we discover that the variance in the direct plug-in estimator makes sample complexity suboptimal. We also consider the optimization dynamic for different variational forms. Finally, based on our theory, we explain an implicit acceleration of using a Sobolev norm as the objective function for training
[02223] Momentum Based Acceleration for Stochastic Gradient Descent
Format : Talk at Waseda University
Author(s) :
Kanan Gupta (Texas A&M University)
Stephan Wojtowytsch (Texas A&M University)
Jonathan Wolfram Siegel (Texas A&M University)
Abstract : We will discuss first order optimization algorithms based on discretizations of the heavy ball ODE. Particularly, we introduce a discretization that leads to an accelerated gradient descent algorithm, which provably achieves an accelerated rate of convergence for convex objective functions, even if the stochastic noise in the gradient estimates is significantly larger than the gradient. We compare the optimizer’s performance with other popular optimizers on the non-convex problem of training neural networks on some standard datasets.
[03468] The Effects of Activation Functions on the Over-smoothing of GCNs
Format : Online Talk on Zoom
Author(s) :
Bao Wang (University of Utah)
Abstract : Smoothness has been shown to be crucial for the success of graph convolutional networks (GCNs); however,
over-smoothing has become inevitable. In this talk, I will present a geometric characterization of how activation functions of a graph convolution layer affect the smoothness of their input leveraging the distance of graph node features to the eigenspace of the largest eigenvalue of the (augmented) normalized adjacency matrix, denoted as M. In particular, we show that 1) the input and output of ReLU or leaky ReLU activation function are related by a high-dimensional ball, 2) activation functions can increase, decrease, or preserve the smoothness of node features, and 3) adjusting the
component of the input in the eigenspace M can control the smoothness of the output of activation functions. Informed by our theory, we propose a universal smooth control term to modulate the smoothness of learned node features and improve the performance of existing graph neural networks.
[03523] On the generalization and training of Deep Operator Networks
Format : Online Talk on Zoom
Author(s) :
Yeonjong Shin (North Carolina State University)
Sanghyun Lee (Florida State University)
Abstract : We propose a novel training method for Deep Operator Networks (DeepONets). DeepONets are constructed by a sum of products of two sub-networks, namely, branch and trunk networks. The goal is to effectively learn DeepONets that accurately approximate nonlinear operators from data. The standard approaches train the two sub-networks simultaneously via first-order optimization methods. The proposed method, however, trains trunk networks first based on norm-minimization, and then trains branch networks in sequence. Furthermore, we estimate a generalization error of DeepONets in terms of the numbers of training data, sensors in inputs and outputs, and DeepONet size. Several numerical examples including Darcy flow in heterogeneous porous media are presented to illustrate the effectiveness of the proposed training method.