CaltechAUTHORS
  A Caltech Library Service

Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization

Azizan, Navid and Lale, Sahin and Hassibi, Babak (2019) Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization. . (Unpublished) http://resolver.caltech.edu/CaltechAUTHORS:20190628-084821622

[img] PDF - Submitted Version
See Usage Policy.

1631Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:20190628-084821622

Abstract

Most modern learning problems are highly overparameterized, meaning that there are many more parameters than the number of training data points, and as a result, the training loss may have infinitely many global minima (parameter vectors that perfectly interpolate the training data). Therefore, it is important to understand which interpolating solutions we converge to, how they depend on the initialization point and the learning algorithm, and whether they lead to different generalization performances. In this paper, we study these questions for the family of stochastic mirror descent (SMD) algorithms, of which the popular stochastic gradient descent (SGD) is a special case. Our contributions are both theoretical and experimental. On the theory side, we show that in the overparameterized nonlinear setting, if the initialization is close enough to the manifold of global minima (something that comes for free in the highly overparameterized case), SMD with sufficiently small step size converges to a global minimum that is approximately the closest one in Bregman divergence. On the experimental side, our extensive experiments on standard datasets and models, using various initializations, various mirror descents, and various Bregman divergences, consistently confirms that this phenomenon happens in deep learning. Our experiments further indicate that there is a clear difference in the generalization performance of the solutions obtained by different SMD algorithms. Experimenting on a standard image dataset and network architecture with SMD with different kinds of implicit regularization, ℓ_1 to encourage sparsity, ℓ_2 yielding SGD, and ℓ_(10) to discourage large components in the parameter vector, consistently and definitively shows that ℓ_(10)-SMD has better generalization performance than SGD, which in turn has better generalization performance than ℓ_1-SMD.


Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription
http://arxiv.org/abs/1906.03830arXivDiscussion Paper
ORCID:
AuthorORCID
Azizan, Navid0000-0002-4299-2963
Record Number:CaltechAUTHORS:20190628-084821622
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:20190628-084821622
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:96811
Collection:CaltechAUTHORS
Deposited By: Tony Diaz
Deposited On:28 Jun 2019 17:06
Last Modified:28 Jun 2019 17:06

Repository Staff Only: item control page