Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.
We thank Ben Recht for pointing out to us the Barron's work on approximation bounds for neural networks (Barron, 1993), and thank Arthur Gretton for discussion about estimation of score function. We also acknowledge fruitful discussion with Roi Weiss about the presentation of proof of Theorem 5 on combining estimation and approximation bounds, and his detailed editorial comments about the preliminary version of the draft. We thank Peter Bartlett for detailed discussions and pointing us to several classical results on neural networks. We are very thankful for Andrew Barron for detailed discussion and for encouraging us to explore alternate incremental training methods for estimating remaining parameters after the tensor decomposition step and we have added this discussion to the paper. We thank Daniel Hsu for discussion on random design analysis of ridge regression. We thank Percy Liang for discussion about score function. M. Janzamin is supported by NSF BIGDATA award FG16455. H. Sedghi is supported by NSF Career award FG15890. A. Anandkumar is supported in part by Microsoft Faculty Fellowship, NSF Career award CCF-1254106, and ONR Award N00014-14-1-0665.
Submitted - 1506.08473.pdf