Abstract : Deep learning has become increasingly popular recently due to its superior performance on a variety of tasks. However, understanding the mathematical foundations of deep networks is challenging because they are highly parameterized and built using complex computational heuristics.
This session features recent works in mathematical deep learning theory based on empirically-driven, phenomenological approaches: the works first reveal prevalent phenomena observed across multiple deep learning settings; then, they propose new mathematical explanations for these phenomena. These mathematical results give important insights into the generalization of deep nets, the efficiency of their optimization, and their robustness to adversarial noise and data imbalance.
[02983] How different optimizers select the global minimizer in deep learning
Format : Talk at Waseda University
Author(s) :
Weinan E (Peking University)
Abstract : It has been observed in deep learning that different optimization process converges to different global minimum.
We analyze this process by studying the stability of the minimum for the optimization algorithm. We find numerically
that in practice, the numerical solutions often live at the so-called "edge of stability". We also study the relationship
between the "sharpness" and "non-uniformity" of the global minimum. We present arguments that confirm the
hypothesis that flat solution tends to generalize better. This is joint work with Chao Ma of Stanford University and Lei
Wu of Peking University.
[03263] A Law of Data Separation in Deep Learning
Format : Talk at Waseda University
Author(s) :
Hangfeng He (University of Rochester)
Weijie Su (University of Pennsylvania)
Abstract : While deep learning has enabled significant advances in many areas of science, its black-box nature hinders architecture design for future artificial intelligence applications and interpretation for high-stakes decision makings. We addressed this issue by studying the fundamental question of how deep neural networks process data in the intermediate layers. Our finding is a simple and quantitative law that governs how deep neural networks separate data according to class membership throughout all layers for classification. This law shows that each layer improves data separation at a constant geometric rate, and its emergence is observed in a collection of network architectures and datasets during training. This law offers practical guidelines for designing architectures, improving model robustness and out-of-sample performance, as well as interpreting the predictions.
[03057] Neural Collapse in Deep Learning
Format : Talk at Waseda University
Author(s) :
Vardan Papyan (University of Toronto)
Abstract : In this talk, we'll delve into Neural Collapse - a recently discovered phenomenon in deep learning that emerges in the final phase of model training. Additionally, we'll explore how this phenomenon relates to the interpretability, robustness, and generalization performance of deep learning models.
[03168] Memorization-Dilation: A Novel Model for Neural Collapse in Deep Neural Network Classifiers
Format : Online Talk on Zoom
Author(s) :
Gitta Kutyniok (LMU Munich)
Duc Anh Nguyen (---)
Ron Levie (Technion)
Julian Lienen (University of Paderborn)
Eyke Hüllermeier (LMU Munich)
Abstract : The notion of neural collapse refers to several emergent phenomena that have been empirically observed across various canonical classification problems. In this talk, we introduce a more realistic mathematical model than the classical layer-peeled model, which takes both the positivity of the features and the limited expressivity of the network into account. For this, we then show results about the performance of the trained network in the sense of generalization properties.
00432 (2/2) : 2E @E812 [Chair: X.Y. Han & Weijie Su]
[04878] Understanding Deep Learning Through Optimization Geometry
Format : Talk at Waseda University
Author(s) :
Nathan Srebro (TTIC)
Abstract : I will survey the approach emerging in recent years for understanding deep learning through the optimization geometry in function space induced by the architecture. To appreciate this view, its ability to explain empirical phenomena, and also its limitations, we will see how many phenomena can be understood through a detailed study of optimization geometry in a simple deep linear model.
[04289] Feature Learning in Two-Layer Neural Networks
Format : Talk at Waseda University
Author(s) :
Murat Erdogdu (University of Toronto)
Abstract : We study the effect of gradient-based optimization on feature learning in two-layer neural networks. We consider the non-asymptotic setting, and we show that a network trained via SGD exhibits low-dimensional representations, with applications in learning a monotone single-index model.
[03006] On the Implicit Geometry of Deep-net Classifiers
Format : Talk at Waseda University
Author(s) :
Tina Behnia (University of British Columbia)
Ganesh Ramachandra Kini (University of California, Santa Barbara)
Vala Vakilian (University of British Columbia)
Christos Thrampoulidis (University of British Columbia)
Abstract : The talk will address the following questions:
What are the unique structural properties of models learned by deep-net classifiers?
Is there an implicit bias towards solutions of a certain geometry and how does this vary across architectures and data?
Specifically, how does this implicit geometry change under label imbalances, and is it possible to use this information to design better loss functions for learning with imbalances?
[05316] Does the loss function matter in overparameterized models?
Format : Online Talk on Zoom
Author(s) :
Vidya Muthukumar (Georgia Institute of Technology)
Abstract : Recent years have seen substantial interest in a first-principles theoretical understanding of the behavior of overparameterized models that interpolate noisy training data, based on their surprising empirical success. In this talk, I compare classification and regression tasks in the overparameterized linear model. On the one hand, we show that with sufficient overparameterization, solutions obtained by training on the squared loss ( minimum-norm interpolation) typically used for regression, are identical to those produced by training on exponential and polynomially-tailed losses, typically used for classification. On the other hand, we show that there exist regimes where these solutions are consistent when evaluated by the 0-1 test loss function, but inconsistent if evaluated by the mean-squared-error test loss function. Our results demonstrate that: a) different loss functions at the training (optimization) phase can yield similar solutions, and b) a significantly higher level of effective overparameterization admits good generalization in classification tasks as compared to regression tasks.