Accelerating local SGD for non-IID data using variance reduction
Xianfeng LIANG, Shuheng SHEN, Enhong CHEN, Jinchang LIU, Qi LIU, Yifei CHENG, Zhen PAN
Accelerating local SGD for non-IID data using variance reduction
Distributed stochastic gradient descent and its variants have been widely adopted in the training of machine learning models, which apply multiple workers in parallel. Among them, local-based algorithms, including LocalSGD and FedAvg, have gained much attention due to their superior properties, such as low communication cost and privacy-preserving. Nevertheless, when the data distribution on workers is non-identical, local-based algorithms would encounter a significant degradation in the convergence rate. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to deal with the heterogeneous data. Without extra communication cost, VRL-SGD can reduce the gradient variance among workers caused by the heterogeneous data, and thus it prevents local-based algorithms from slow convergence rate. Moreover, we present VRL-SGD-W with an effective warm-up mechanism for the scenarios, where the data among workers are quite diverse. Benefiting from eliminating the impact of such heterogeneous data, we theoretically prove that VRL-SGD achieves a linear iteration speedup with lower communication complexity even if workers access non-identical datasets. We conduct experiments on three machine learning tasks. The experimental results demonstrate that VRL-SGD performs impressively better than Local SGD for the heterogeneous data and VRL-SGD-W is much robust under high data variance among workers.
distributed optimization / variance reduction / local SGD / federated learning / non-IID data
Xianfeng Liang is currently a MS student in the School of Computer Science and Technology at the University of Science and Technology of China (USTC), China. His major research interests include machine learning and optimization
Shuheng Shen received the MS degree from University of Science and Technology of China (USTC), China in 2020. His major research interests include machine learning, stochastic optimization and distributed optimization
Enhong Chen is a professor and vice dean of the School of Computer Science at University of Science and Technology of China (USTC), China. He received the PhD degree from USTC, China. His general area of research includes data mining and machine learning, social network analysis and recommender systems. He has published more than 100 papers in refereed conferences and journals, including IEEE Trans. KDE, IEEE Trans. MC, KDD, ICDM, NIPS, and CIKM. He was on program committees of numerous conferences including KDD, ICDM, SDM. He is a senior member of the IEEE
Jingchang Liu received the MS degree from University of Science and Technology of China (USTC), China in 2019. His major research interests include machine learning, stochastic optimization and distributed optimization
Qi Liu is a professor at University of Science and Technology of China (USTC), China. He received the PhD degree in computer science from USTC, China. His general area of research is data mining and knowledge discovery. He has published prolifically in refereed journals and conference proceedings, e.g., TKDE, TOIS, TKDD, TIST, KDD, IJCAI, AAAI, ICDM, SDM and CIKM. He has served regularly in the program committees of a number of conferences, and is a reviewer for the leading academic journals in his fields. He is a member of ACM and IEEE. Dr. Liu is the recipient of the KDD 2018 Best Student Paper Award (Research) and the ICDM 2011 Best Research Paper Award. He is supported by the Young Elite Scientist Sponsorship Program of CAST and the Youth Innovation Promotion Association of CAS
Yifei Cheng is currently working toward the PhD degree in the School of Data Science, University of Science and Technology of China, China. His current research interests include machine learning, distributed optimization and federated learning
Zhen Pan received the PhD degree from University of Science and Technology of China, China in 2020. His major research interests include machine learning and data mining
[1] |
Robbins H , Monro S . A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22( 3): 400– 407
|
[2] |
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770– 778
|
[3] |
Chen J, Liu D, Xu T, Wu S, Cheng Y, Chen E. Is heuristic sampling necessary in training deep object detectors? 2019, arXiv preprint arXiv: 1909.04868
|
[4] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019
|
[5] |
Cheng H T, Koc L, Harmsen J, Shaked T, Chandra T, Aradhye H, Anderson G, Corrado G, Chai W, Ispir M, Anil R, Haque Z, Hong L, Jain V, Liu X, Shah H. Wide & deep learning for recommender systems. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. 2016, 7– 10
|
[6] |
Liu Q , Huang Z , Yin Y , Chen E , Xiong H , Su Y , Hu G . EKT: exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering, 2021, 33( 1): 100– 115
|
[7] |
Wu L , Li Z , Zhao H , Pan Z , Liu Q , Chen E . Estimating early fundraising performance of innovations via graph-based market environment model. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34( 4): 6396– 6403
|
[8] |
Liu Y , Liu Q , Zhao H , Pan Z , Liu C . Adaptive quantitative trading: an imitative deep reinforcement learning approach. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34( 2): 2128– 2135
|
[9] |
Wang J , Zhang H , Liu Q , Pan Z , Tao H . Crowdfunding dynamics tracking: a reinforcement learning approach. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34( 4): 6210– 6218
|
[10] |
Wang J, Joshi G. Cooperative SGD: a unified framework for the design and analysis of communication-efficient SGD algorithms. 2019, arXiv preprint arXiv: 1808.07576
|
[11] |
Zhou F, Cong G. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 3219−3227
|
[12] |
Stich S U. Local SGD converges fast and communicates little. In: Proceedings of 2019 International Conference on Learning Representations. 2019
|
[13] |
Yu H , Yang S , Zhu S . Parallel restarted SGD with faster convergence and less communication: demystifying why model averaging works for deep learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33
|
[14] |
Shen S, Xu L, Liu J, Liang X, Cheng Y. Faster distributed deep net training: computation and communication decoupled stochastic gradient descent. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 4582−4589
|
[15] |
McMahan H B, Moore E, Ramage D, Hampson S, Arcas B A. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1273−1282
|
[16] |
Konečný J, McMahan H B, Yu F X, Suresh A T, Bacon D, Richtárik P. Federated learning: strategies for improving communication efficiency. In: Proceedings of the ICLR 2018. 2018
|
[17] |
Li T , Sahu A K , Talwalkar A , Smith V . Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine, 2020, 37( 3): 50– 60
|
[18] |
Kairouz P, McMahan H B, Avent B, Bellet A, Bennis M, et al. Advances and Open Problems in Federated Learning. Foundations and Trends ® in Machine Learning, 2021, 14(1−2): 1−210
|
[19] |
Yang J , Duan Y , Qiao T , Zhou H , Wang J , Zhao W . Prototyping federated learning on edge computing systems. Frontiers of Computer Science, 2020, 14
|
[20] |
Ghadimi S , Lan G . Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013, 23( 4): 2341– 2368
|
[21] |
Dekel O , Gilad-Bachrach R , Shamir O , Xiao L . Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 2012, 13
|
[22] |
Alistarh D, Grubic D, Li J, Tomioka R, Vojnovic M. QSGD: communication-efficient SGD via gradient quantization and encoding. In: Proceedings of the Advances in Neural Information Processing Systems. 2017, 1709−1720
|
[23] |
Aji A F, Heafield K. Sparse communication for distributed gradient descent. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 440– 445
|
[24] |
Bernstein J, Zhao J, Azizzadenesheli K, Anandkumar A. SignSGD with majority vote is communication efficient and fault tolerant. In: Proceedings of the International Conference on Learning Representations. 2019
|
[25] |
Lin Y, Han S, Mao H, Wang Y, Dally W J. Deep gradient compression: reducing the communication bandwidth for distributed training. In: Proceedings of the International Conference on Learning Representations. 2018
|
[26] |
Karimireddy S P, Rebjock Q, Stich S U, Jaggi M. Error feedback fixes SignSGD and other gradient compression schemes. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 3252−3261
|
[27] |
Tang H, Yu C, Lian X, Zhang T, Liu J. DoubleSqueeze: parallel stochastic gradient descent with double-pass error-compensated compression. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 6155−6165
|
[28] |
Povey D, Zhang X, Khudanpur S. Parallel training of DNNs with natural gradient and parameter averaging. 2015, arXiv preprint arXiv: 1410.7455
|
[29] |
Su H, Chen H. Experiments on parallel training of deep neural network using model averaging. 2018, arXiv preprint arXiv: 1507.01239
|
[30] |
Lin T, Stich S U, Patel K K, Jaggi M. Don’t use large mini-batches, use local SGD. In: Proceedings of the 8th International Conference on Learning Representations. 2020
|
[31] |
Yu H, Jin R, Yang S. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 7184−7193
|
[32] |
Haddadpour F, Kamani M M, Mahdavi M, Cadambe V. Trading redundancy for communication: Speeding up distributed SGD for non-convex optimization. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2545−2554
|
[33] |
Li X, Huang K, Yang W, Wang S, Zhang Z. On the convergence of FedAvg on non-IID data. In: Proceedings of the 8th International Conference on Learning Representations. 2020
|
[34] |
Khaled A, Mishchenko K, Richtarik P. Tighter theory for local SGD on identical and heterogeneous data. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. 2020, 4519−4529
|
[35] |
Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 315– 323
|
[36] |
Zhang L, Mahdavi M, Jin R. Linear convergence with condition number independent access of full gradients. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 980– 988
|
[37] |
Defazio A, Bach F, Lacoste-Julien S. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 1646−1654
|
[38] |
Nguyen L M, Liu J, Scheinberg K, Takáč M. SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2613−2621
|
[39] |
Shi W , Ling Q , Wu G , Yin W . EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 2015, 25( 2): 944– 966
|
[40] |
Mokhtari A , Ribeiro A . DSA: decentralized double stochastic averaging gradient algorithm. The Journal of Machine Learning Research, 2016, 17( 1): 2165– 2199
|
[41] |
Tang H, Lian X, Yan M, Zhang C, Liu J. D2: decentralized training over decentralized data . In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4855−4863
|
[42] |
Karimireddy S P, Kale S, Mohri M, Reddi S, Stich S, Suresh A T. SCAFFOLD: stochastic controlled averaging for federated learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 5132−5143
|
[43] |
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in PyTorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017
|
[44] |
Zhang S, Choromanska A, LeCun Y. Deep learning with elastic averaging SGD. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 685– 693
|
[45] |
El-Sawy A, El-Bakry H, Loey M. CNN for handwritten Arabic digits recognition based on LeNet-5. In: Proceedings of the International Conference on Advanced Intelligent Systems and Informatics. 2016, 566– 575
|
[46] |
LeCun Y. The MNIST database of handwritten digits. See Yann.lecun.com/exdb/mnist/ website, 1998
|
[47] |
Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014
|
[48] |
Lehmann J , Isele R , Jakob M , Jentzsch A , Kontokostas D , Mendes P N , Hellmann S , Morsey M , van Kleef P , Auer S , Bizer C . DBpedia−A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 2015, 6( 2): 167– 195
|
[49] |
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248– 255
|
[50] |
Moschitti A, Pang B, Daelemans W. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543
|
[51] |
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2818−2826
|
[52] |
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 448– 456
|
/
〈 | 〉 |