Talk Title: Order-optimal convergence rates with adaptive SGD
Abstract: We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient descent (SGD) methods, which adjust their step sizes based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non-adaptive methods in this setting. Specifically, all prior studies rely on some subset of the following assumptions which are unnecessary in the analysis of fixed step-size SGD: (i) uniformly bounded gradient norms, (ii) uniformly bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal (up to polylogarithmic factors) convergence rate scaling inversely with the square-root of T after T iterations under the same standard assumptions as a well-tuned fixed step-size SGD (unbounded gradient norms and affine noise variance scaling). We also go beyond the standard smoothness assumption, proving similar results for (L_0, L_1)-smooth (non-convex) functions (where the Hessian can scale affinely with the gradient, e.g. such as the exponential function). This is the first rate of convergence for any algorithm in this smoothness and noise regime, even for the special case of uniformly-bounded variance. I will conclude the talk by discussing some of my broader research directions. This talk is based on joint works with Isidoros Tziotis, Litu Rout, Constantine Caramanis, Aryan Mokhtari, Rachel Ward, and Sanjay Shakkottai. References: https://arxiv.org/abs/2202.05791 (COLT 2022) and https://arxiv.org/abs/2302.06570 (COLT 2023).
- Tags
-