Convergence of Gradient Algorithms for Nonconvex C 1+α Cost Functions

Zixuan Wang , Shanjian Tang

Chinese Annals of Mathematics, Series B ›› 2023, Vol. 44 ›› Issue (3) : 445 -464.

PDF
Chinese Annals of Mathematics, Series B ›› 2023, Vol. 44 ›› Issue (3) : 445 -464. DOI: 10.1007/s11401-023-0024-y
Article

Convergence of Gradient Algorithms for Nonconvex C 1+α Cost Functions

Author information +
History +
PDF

Abstract

This paper is concerned with convergence of stochastic gradient algorithms with momentum terms in the nonconvex setting. A class of stochastic momentum methods, including stochastic gradient descent, heavy ball and Nesterov’s accelerated gradient, is analyzed in a general framework under mild assumptions. Based on the convergence result of expected gradients, the authors prove the almost sure convergence by a detailed discussion of the effects of momentum and the number of upcrossings. It is worth noting that there are not additional restrictions imposed on the objective function and stepsize. Another improvement over previous results is that the existing Lipschitz condition of the gradient is relaxed into the condition of Hölder continuity. As a byproduct, the authors apply a localization procedure to extend the results to stochastic stepsizes.

Keywords

Gradient descent methods / Nonconvex optimization / Accelerated gradient descent / Heavy-ball momentum

Cite this article

Download citation ▾
Zixuan Wang, Shanjian Tang. Convergence of Gradient Algorithms for Nonconvex C 1+α Cost Functions. Chinese Annals of Mathematics, Series B, 2023, 44(3): 445-464 DOI:10.1007/s11401-023-0024-y

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

An, W., Wang, H., Sun, Q., et al., A PID Controller Approach for Stochastic Optimization of Deep Networks, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018, 8522–8531.

[2]

Becker, S. and Lecun, Y., Improving the Convergence of Back-propagation Learning with Second-order Methods, Proceedings of the 1988 Connectionist Models Summer School, San Mateo, 1988, 29–37.

[3]

Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming, 1996, Belmont, MA: Athena Scientific

[4]

Bertsekas D P, Tsitsiklis J N. Gradient convergence in gradient methods with errors. SIAM J. Control Optim., 2000, 10(3): 627-642

[5]

Bottou L, Curtis F E, Nocedal J. Optimization methods for large-scale machine learning. SIAM Rev., 2018, 60(2): 223-311

[6]

Cauchy A. Méthode générale pour la résolution des systèmes d’équations simultanées. Comp. Rend. Sci. Paris, 1847, 25: 536-538

[7]

Chen, X., Liu, S., Sun, R. and Hong, M., On the Convergence of a Class of Adam-type Algorithms for Non-convex Optimization, ICLR 2019, Seventh International Conference on Learning Representations, New Orleans, 2019.

[8]

Curry H B. The method of steepest descent for non-linear minimization problems. Q. Appl. Math., 1944, 2(3): 258-261

[9]

Cutkosky, A. and Orabona, F., Momentum-based Variance Reduction in Non-convex SGD, Advances in Neural Information Processing Systems 32 (NIPS 2019), Vancouver, 2019, 15210–15219.

[10]

Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 2011, 12(61): 2121-2159

[11]

Fort J C, Pagès G. Convergence of stochastic algorithms: From the Kushner-Clark theorem to the Lyapounov function method. Adv. Appl. Probab., 1996, 28(4): 1072-1094

[12]

Gadat S, Panloup F, Saadane S. Stochastic heavy ball. Electron. J. Stat., 2018, 12(1): 461-529

[13]

Ghadimi, E., Feyzmahdavian, H. R. and Johansson, M., Global Convergence of the Heavy-ball Method for Convex Optimization, ROC15, 14th European Control Conference, Linz, 2015, 310–315.

[14]

Ghadimi S, Lan G. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 2013, 23(4): 2341-2368

[15]

Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program., 2016, 156(1): 59-99

[16]

Gitman, I., Lang, II., Zhang, P. and Xiao, L., Understanding the Role of Momentum in Stochastic Gradient Methods, Advances in Neural Information Processing Systems 32 (NIPS 2019), Vancouver, 2019, 9633–9643.

[17]

Hoffman, M. D. and Blei, D. M., Stochastic Structured Variational Inference, AISTATS 2015, 18th International Conference on Artificial Intelligence and Statistics, San Diego, 2015, 361–369.

[18]

Kiefer J, Wolfowitz J. Stochastic estimation of the maximum of a regression function. Ann. Math. Stat., 1952, 23(3): 462-466

[19]

Kingma, D. P. and Ba, J. L., Adam: A Method for Stochastic Optimization, ICLR 2015, Third International Conference on Learning Representations, San Diego, 2015.

[20]

Kushner H J, Clark D S. Stochastic Approximation Methods for Constrained and Unconstrained Systems, 1978, New York-Berlin: Springer-Verlag

[21]

Kushner H J, Shwartz A. An invariant measure approach to the convergence of stochastic approximations with state dependent noise. SIAM J. Control Optim., 1984, 22(1): 13-27

[22]

Lessard L, Recht B, Packard A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim., 2016, 26(1): 57-95

[23]

Li, Q., Tai, C. and B, W., Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms, ICML’17, Proceedings of the 34th International Conference on Machine Learning, Sydney, 2017, 2101–2110.

[24]

Li Q, Tai C, E W. Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations. J. Mach. Learn. Res., 2019, 20(40): 1-47

[25]

Li, X. and Orabona, F., On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes, AISTATS 2018, 21st International Conference on Artificial Intelligence and Statistics, Playa Blanca, 2018, 983–992.

[26]

Ljung L. Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control, 1977, 22(4): 551-575

[27]

Nesterov Y E. A method for unconstrained convex minimization problem with the rate of convergence O(l/k 2). Dokl. Akad. Nauk SSSR, 1983, 269: 543-547

[28]

Nesterov Y E. Introductory Lectures on Convex Optimization: A Basic Course, 2004, Boston, MA: Kluwer Academic Publishers

[29]

Nesterov Y R. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim., 2012, 22(2): 341-362

[30]

Polyak B T. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys., 1964, 4(5): 1-17

[31]

Polyak B T. Introduction to Optimization, 1987, New York: Optimization Software, Inc.

[32]

Polyak B T, Juditsky A B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 1992, 30(4): 838-855

[33]

Reddi, S. J., Kale, S. and Kumar, S., On the Convergence of Adam and Beyond, ICLR 2018, Sixth International Conference on Learning Representations, Vancouver, 2018.

[34]

Rennie, J. D. M., Smooth hinge classification, http://people.csail.mit.edu/jrennie/writing, Massachusetts Inst. Technol., 2005.

[35]

Robbins H, Monro S. A stochastic approximation method. Ann. Math. Stat., 1951, 22(3): 400-407

[36]

Robbins, H. and Siegmund, D., A convergence theorem for non negative almost supermartingales and some applications, Optimizing Methods in Statistics, 1971, 233–257.

[37]

Sypherd, T., Diaz, M., Sankar, L. and Kairouz, P., A Tunable Loss Function for Binary Classification, 2019 IEEE International Symposium on Information Theory, Paris, 2019, 2479–2483.

[38]

Tao H, Hou C, Nie F Effective discriminative feature selection with nontrivial solution. IEEE Trans. Neural Netw. Learn. Syst., 2016, 27(4): 796-808

[39]

Xiong H, Chi Y, Hu B, Zhang W. Analytical convergence regions of accelerated gradient descent in nonconvex optimization under regularity condition. Automatica, 2020, 113: 108715

[40]

Yan, Y., Yang, T., Li, Z., et al., A Unified Analysis of Stochastic Momentum Methods for Deep Learning, Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCA1 2018), Stockholm, 2018, 2955–2961.

AI Summary AI Mindmap
PDF

141

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/