A Caltech Library Service

Stochastic Mirror Descent on Overparameterized Nonlinear Models

Azizan, Navid and Lale, Sahin and Hassibi, Babak (2022) Stochastic Mirror Descent on Overparameterized Nonlinear Models. IEEE Transactions on Neural Networks and Learning Systems, 33 (12). 7717 - 7727. ISSN 2162-2388. doi:10.1109/TNNLS.2021.3087480.

[img] PDF - Accepted Version
See Usage Policy.

[img] PDF - Submitted Version
See Usage Policy.


Use this Persistent URL to link to this item:


Most modern learning problems are highly overparameterized, i.e., have many more model parameters than the number of training data points. As a result, the training loss may have infinitely many global minima (parameter vectors that perfectly “interpolate” the training data). It is therefore imperative to understand which interpolating solutions we converge to, how they depend on the initialization and learning algorithm, and whether they yield different test errors. In this article, we study these questions for the family of stochastic mirror descent (SMD) algorithms, of which stochastic gradient descent (SGD) is a special case. Recently, it has been shown that for overparameterized linear models, SMD converges to the closest global minimum to the initialization point, where closeness is in terms of the Bregman divergence corresponding to the potential function of the mirror descent. With appropriate initialization, this yields convergence to the minimum-potential interpolating solution, a phenomenon referred to as implicit regularization. On the theory side, we show that for sufficiently- overparameterized nonlinear models, SMD with a (small enough) fixed step size converges to a global minimum that is “very close” (in Bregman divergence) to the minimum-potential interpolating solution, thus attaining approximate implicit regularization. On the empirical side, our experiments on the MNIST and CIFAR-10 datasets consistently confirm that the above phenomenon occurs in practical scenarios. They further indicate a clear difference in the generalization performances of different SMD algorithms: experiments on the CIFAR-10 dataset with different regularizers, ℓ₁ to encourage sparsity, ℓ₂ (SGD) to encourage small Euclidean norm, and ℓ∞ to discourage large components, surprisingly show that the ℓ∞ norm consistently yields better generalization performance than SGD, which in turn generalizes better than the ℓ₁ norm.

Item Type:Article
Related URLs:
URLURL TypeDescription Paper
Azizan, Navid0000-0002-4299-2963
Lale, Sahin0000-0002-7191-346X
Hassibi, Babak0000-0002-1375-5838
Alternate Title:Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization
Additional Information:© 2021 IEEE. Manuscript received June 8, 2020; revised December 7, 2020 and May 10, 2021; accepted May 24, 2021. This work was supported in part by the National Science Foundation under Grant ECCS-1509977, in part by Qualcomm Inc., in part by the NASA’s Jet Propulsion Laboratory through the President and Director’s Fund, in part by Amazon Web Services Inc., and in part by PIMCO, LLC. This article was presented in part at the 2019 International Conference on Machine Learning (ICML) Generalization Workshop, Long Beach, CA, USA.
Funding AgencyGrant Number
JPL President and Director's FundUNSPECIFIED
Amazon Web ServicesUNSPECIFIED
Subject Keywords:Deep learning, implicit regularization, mirror descent, overparameterization, stochastic gradient descent (SGD)
Issue or Number:12
Record Number:CaltechAUTHORS:20190628-084821622
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:96811
Deposited By: Tony Diaz
Deposited On:28 Jun 2019 17:06
Last Modified:23 Dec 2022 16:49

Repository Staff Only: item control page