CaltechAUTHORS
  A Caltech Library Service

Continuous Time Analysis of Momentum Methods

Kovachki, Nikola B. and Stuart, Andrew M. (2021) Continuous Time Analysis of Momentum Methods. Journal of Machine Learning Research, 22 (17). pp. 1-40. ISSN 1533-7928. https://resolver.caltech.edu/CaltechAUTHORS:20210503-091850360

[img] PDF - Published Version
Creative Commons Attribution.

850kB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20210503-091850360

Abstract

Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak's Heavy Ball method (HB) and Nesterov's method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent, when implemented in the non-asymptotic regime with fixed small, but non-zero, learning rate. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter; this is known as the method of modified equations in the numerical analysis literature, recently rediscovered as the high resolution ODE approximation in the machine learning context. In our second result we show that the momentum method is approximated by a continuous time gradient flow, with an additional momentum-dependent second order time-derivative correction, proportional to the learning rate; this may be used to explain the stabilizing effect of momentum algorithms in their transient phase. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function, equal to the original loss function plus a small perturbation proportional to the learning rate; this small correction provides convexification of the loss function and encodes additional robustness present in momentum methods, beyond the transient phase.


Item Type:Article
Related URLs:
URLURL TypeDescription
https://jmlr.org/papers/v22/19-466.htmlPublisherArticle
ORCID:
AuthorORCID
Kovachki, Nikola B.0000-0002-3650-2972
Additional Information:© 2021 Nikola B. Kovachki and Andrew M. Stuart. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Submitted 6/19; Revised 9/20; Published 1/21. Both authors are supported, in part, by the US National Science Foundation (NSF) grant DMS 1818977, the US Office of Naval Research (ONR) grant N00014-17-1-2079, and the US Army Research Office (ARO) grant W911NF-12-2-0022. Both authors are also grateful to the anonymous reviewers for their invaluable suggestions which have helped to significantly strengthen this work.
Funders:
Funding AgencyGrant Number
NSFDMS-1818977
Office of Naval Research (ONR)N00014-17-1-2079
Army Research LaboratoryW911NF-12-2-0022
Issue or Number:17
Record Number:CaltechAUTHORS:20210503-091850360
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20210503-091850360
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:108917
Collection:CaltechAUTHORS
Deposited By: Tony Diaz
Deposited On:03 May 2021 17:59
Last Modified:03 May 2021 17:59

Repository Staff Only: item control page