Adam revisited: a weighted past gradients perspective
Hui ZHONG, Zaiyi CHEN, Chuan QIN, Zai HUANG, Vincent W. ZHENG, Tong XU, Enhong CHEN
Adam revisited: a weighted past gradients perspective
Adaptive learning rate methods have been successfully applied in many fields, especially in training deep neural networks. Recent results have shown that adaptive methods with exponential increasing weights on squared past gradients (i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the non-convergence issues, achieving a data-dependent regret bound similar to or better than ADAGRAD is still a challenge to these methods. In this paper, we propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues. Unlike AMSGRAD and ADAMNC, we consider using a milder growing weighting strategy on squared past gradient, in which weights grow linearly. Based on this idea, we propose weighted adaptive gradient method framework (WAGMF) and implement WADA algorithm on this framework. Moreover, we prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly. This bound may partially explain the good performance of ADAM in practice. Finally, extensive experiments demonstrate the effectiveness of WADA and its variants in comparison with several variants of ADAM on training convex problems and deep neural networks.
adaptive learning rate methods / stochastic gra-dient descent / online learning
[1] |
Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22(3): 400–407
CrossRef
Google scholar
|
[2] |
Duchi J C, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12(Jul): 2121–2159
|
[3] |
Tieleman T, Hinton G. Lecture 6.5-rmsprop: divide the gradient b a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012, 4(2): 26–31
|
[4] |
Zeiler M D. Adadelta: an adaptive learning rate method. 2012, arXiv preprint arXiv:1212.5701
|
[5] |
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations. 2015
|
[6] |
Yin Y, Huang Z, Chen E, Liu Q, Zhang F, Xie X, Hu G. Transcribing content from structural images with spotlight mechanism. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2643–2652
CrossRef
Google scholar
|
[7] |
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1097–1105
|
[8] |
Su Y, Liu Q, Liu Q, Huang Z, Yin Y, Chen E, Ding C, Wei S, Hu G. Exercise-enhanced sequential modeling for student performance prediction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 2435–2443
|
[9] |
Liu Q, Huang Z, Huang Z, Liu C, Chen E, Su Y, Hu G. Finding similar exercises in online education systems. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 1821–1830
CrossRef
Google scholar
|
[10] |
Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988, 24(5): 513–523
CrossRef
Google scholar
|
[11] |
LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278–2324
CrossRef
Google scholar
|
[12] |
Hazan E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016, 2(3–4): 157–325
CrossRef
Google scholar
|
[13] |
Reddi S J, Kale S, Kumar S. On the convergence of adam and beyond. In: Proceedings of International Conference on Learning Representations. 2018
|
[14] |
Mukkamala M C, Hein M. Variants of RMSProp and Adagrad with logarithmic regret bounds. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2545–2553
|
[15] |
Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1571–1578
|
[16] |
Shamir O, Zhang T. Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 71–79
|
[17] |
Lacoste-Julien S, Schmidt M, Bach F. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. 2012, arXiv preprint arXiv:1212.2002
|
[18] |
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le Q V, Ng A Y. Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1223–1231
|
[19] |
Chen Z, Xu Y, Chen E, Yang T. SADAGRAD: strongly adaptive stochastic gradient methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 912–920
|
[20] |
Huang H, Wang C, Dong B. Nostalgic adam: weighing more of the past gradients when designing the adaptive learning rate. 2018, arXiv preprint arXiv:1805.07557
CrossRef
Google scholar
|
[21] |
Chen J, Gu Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. 2018, arXiv preprint arXiv:1806.06763
|
[22] |
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning. 2003, 928–936
|
[23] |
Cesa-Bianchi N, Conconi A, Gentile C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 2004, 50(9): 2050–2057
CrossRef
Google scholar
|
[24] |
Bernstein J, Wang Y, Azizzadenesheli K, Anandkumar A. SIGNSGD: compressed optimisation for non-convex problems. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 559–568
|
[25] |
Dozat T. Incorporating nesterov momentum into adam. In: Proceedings of International Conference on Learning Representations, Workshop Track. 2016
|
[26] |
Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Technical Report, Citeseer, 2009
|
[27] |
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778
CrossRef
Google scholar
|
[28] |
Yang T, Yan Y, Yuan Z, Jin R. Why does stagewise training accelerate convergence of testing error over SGD? 2018, arXiv preprint arXiv:1812.03934
|
[29] |
Perez L,Wang J. The effectiveness of data augmentation in image classification using deep learning. 2017, arXiv preprint arXiv:1712.04621
|
[30] |
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010, 249–256
|
[31] |
Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference onMachine Learning. 2010, 807–814
|
[32] |
Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1): 1929–1958
|
/
〈 | 〉 |