Stochastic Gradient Descent-type (SGD) algorithms have been widely applied to many non-convex optimization problems in machine learning, e.g., training deep neural networks, variational Bayesian inference and collaborative filtering. Due to current technical limit, however, establishing convergence properties of SGD for these highly complicated practical non-convex problems is generally infeasible. Therefore, we propose to analyze the behavior of the SGD-type algorithms through two simpler but non-trivial non-convex problems – (1) Streaming Principal Component Analysis and (2) Training Non-overlapping Two-layer Convolutional Neural Networks. Specifically, we prove that for both examples, SGD attains a sub-linear rate of convergence to the global optima with high probability. Our theory not only helps us better understand SGD, but also provides new insights on more complicated non-convex optimization problems in machine learning.
- Tags
-